0.1 Step 1 : preliminaries

Load some libraries. Mostly the same as in original script except for:

library("ggplot2")
library("kableExtra") # This is just for rendering this html. 
library("tinytex")
library("dada2")
library("ShortRead")
library("Biostrings")
library("data.table")
library("biomformat")
library("DECIPHER")
library("phyloseq")
library("msa")
library("kmer")
library("vegan")
library("compositions")
library("htmltools")
library("foreach")
library("doParallel")

First I generate two objects that will harbour all intermediate steps and auxiliary variables for each dataset. I also generate a list for all custom functions.

setwd("/home/rik/fukushima/metabarcoding.git")
XITS <- list()
X16S <- list()
FUNS <- list()
source("functions.R")

Read in the data paths. Obviously adapted to my computer.

XITS$path <-"~/fukushima/metabarcoding.git/HN00198716/ITS"
X16S$path <-"~/fukushima/metabarcoding.git/HN00198716/16S"

XITS$fnFs <- sort(list.files(XITS$path, pattern = "_1.fq.gz", full.names = TRUE))
length(XITS$fnFs)
XITS$fnRs <- sort(list.files(XITS$path, pattern = "_2.fq.gz", full.names = TRUE))
length(XITS$fnRs)

X16S$fnFs <- sort(list.files(X16S$path, pattern = "_1.fq.gz", full.names = TRUE))
length(X16S$fnFs)
X16S$fnRs <- sort(list.files(X16S$path, pattern = "_2.fq.gz", full.names = TRUE))
length(X16S$fnRs)

# Extract all sample names:
XITS$sample.names    <- unname(sapply(XITS$fnFs, FUNS$get.sample.name))
XITS$sample.names[XITS$sample.names == "mock"] <- c("mockITS","mockITS1")
X16S$sample.names    <- unname(sapply(X16S$fnFs, FUNS$get.sample.name))
X16S$sample.names[X16S$sample.names == "mock"] <- c("mock16S","mock16S1")
sample.names         <- unique(c(XITS$sample.names,X16S$sample.names))


META           <- read.table("~/fukushima/metabarcoding.git/metadata.csv",sep=",",dec=".",header=T)[,1:7]
rownames(META) <- META$library
# Here I define 5 colors that we will use throughout for site, used for the levels 
# "Arakawa", "Futaba", "Okuma", "Tsushima" and "University" 
# I will relevel META to put them in a more logical order as well. 
META$site <- factor(META$site,levels=c("University", "Arakawa", "Tsushima", "Okuma", "Futaba"))
sitecolors     <- c("#ADD8E6","#2E8B57","#FF7F50","#FF91A4","#800020")

XITS$META <- META[match(XITS$sample.names, META$library), ]
X16S$META <- META[match(X16S$sample.names, META$library), ]

0.2 Step 2: initial quality checks

Here I advise to plot quality profiles of the same run, FW and RV next to each other. I also analyzed all files with fastQC, and compiled them in a summary report using multiQC. All these summary reports show more or less the same statistics, so it’s up to personal preference which one you prefer to use. But you should actually go through a round of thorough checking for all libraries.

# This is the loop that you can use to generate all 46 quality profiles:
#for (i in 1:46){plotQualityProfile(c(fnFs[i],fnRs[i]))}

# Example: 
plotQualityProfile(c(XITS$fnFs[16],XITS$fnRs[16]))

plotQualityProfile(c(X16S$fnFs[16],X16S$fnRs[16]))

0.3 Step 3 : getting rid of primers and N base calls

In all cases, >99% of sequences are retained. Question: when were the PhiX reads filtered out?

Primers are present and already confirmed by fastQC. At first sight, there seems to be no need to turn them in all directions, but I agree it’s always better safe than sorry. So we do check for primers in all orientations.

XITS$FWD <- "GTGARTCATCGARTCTTTGAA"
XITS$REV <- "TCCTCCGCTTATTGATATGC" 

X16S$FWD <- "CCTACGGGNGGCWGCAG"
X16S$REV <- "GACTACHVGGGTATCTAATCC" 

XITS$FWD.orients <- FUNS$allOrients(XITS$FWD)
XITS$REV.orients <- FUNS$allOrients(XITS$REV)

X16S$FWD.orients <- FUNS$allOrients(X16S$FWD)
X16S$REV.orients <- FUNS$allOrients(X16S$REV)

XITS$fnFs.filtN  <- file.path(XITS$path, "filtN", basename(XITS$fnFs)) 
XITS$fnRs.filtN  <- file.path(XITS$path, "filtN", basename(XITS$fnRs))

X16S$fnFs.filtN  <- file.path(X16S$path, "filtN", basename(X16S$fnFs)) 
X16S$fnRs.filtN  <- file.path(X16S$path, "filtN", basename(X16S$fnRs))


# Our first filtering step using the filterAndTrim function is really just getting rid of sequences containing N nucleotides, and PhiX. 
# About the multithread parameter: only use this if your computer allows parallel processing. 
XITS$out <- filterAndTrim(XITS$fnFs, XITS$fnFs.filtN, XITS$fnRs, XITS$fnRs.filtN, maxN = 0, matchIDs=T, rm.phix=T, multithread = 12)
X16S$out <- filterAndTrim(X16S$fnFs, X16S$fnFs.filtN, X16S$fnRs, X16S$fnRs.filtN, maxN = 0, matchIDs=T, rm.phix=T, multithread = 12)


# For ITS:
XITS$primersummary <- FUNS$generate_primer_summary(XITS, "fnFs.filtN", "fnRs.filtN", multithread = 12)

# For 16S:
X16S$primersummary <- FUNS$generate_primer_summary(X16S, "fnFs.filtN", "fnRs.filtN", multithread = 12)
FW.fw.Forward FW.fw.Complement FW.fw.Reverse FW.fw.RevComp FW.rv.Forward FW.rv.Complement FW.rv.Reverse FW.rv.RevComp RV.fw.Forward RV.fw.Complement RV.fw.Reverse RV.fw.RevComp RV.rv.Forward RV.rv.Complement RV.rv.Reverse RV.rv.RevComp
A1T 77813 0 0 0 0 0 0 13350 0 0 0 19358 75396 0 0 0
A2T 73765 0 0 0 0 0 0 15852 0 0 0 23964 71090 0 0 0
A2T1 80687 0 0 0 0 0 0 8948 0 0 0 13140 77918 0 0 0
A3T 79093 0 0 0 0 0 0 7598 0 0 0 12367 76578 0 0 0
A3T1 66472 0 0 0 0 0 0 9514 0 0 0 14092 64164 0 0 0
A4T 78599 0 0 0 0 0 0 5839 0 0 0 8362 75914 0 0 0
A4T1 92029 0 0 0 0 0 0 4872 0 0 0 6970 89315 0 0 0
A6T 77947 0 0 0 0 0 0 3989 0 0 0 6064 75705 0 0 0
A6T1 93955 0 0 0 0 0 0 3771 0 0 0 5585 90905 0 0 0
A7T 86119 0 0 0 0 0 0 9994 0 0 0 13895 83486 0 0 0
A8T 77877 0 0 0 0 0 0 6869 0 0 0 10560 75306 0 0 0
A9T 79231 0 0 0 1 0 0 8442 1 0 0 12425 76528 0 0 0
C1T 64535 0 0 0 0 0 0 1427 0 0 0 2159 62013 0 0 0
C2T 55851 0 0 0 0 0 0 1210 0 0 0 2572 53699 0 0 0
C3T 82673 0 0 0 0 0 0 1393 0 0 0 3698 79537 0 0 0
C5T 78502 0 0 0 0 0 0 391 0 0 0 669 75487 0 0 0
C6T 67432 0 0 0 0 0 0 467 0 0 0 943 65158 0 0 0
F1T 74167 0 0 0 0 0 0 2794 0 0 0 5249 71791 0 0 0
F3T 73513 0 0 0 0 0 0 4541 0 0 0 8099 71079 0 0 0
FU1T 84775 0 0 0 0 0 0 2505 0 0 0 4442 82113 0 0 0
FU2T 67011 0 0 0 0 0 0 2047 0 0 0 4614 64837 0 0 0
FU3T 78759 0 0 0 0 0 0 6105 0 0 0 9842 76029 0 0 0
FU3T1 76464 0 0 0 0 0 0 4650 0 0 0 8789 74295 0 0 0
HT 71981 0 0 0 0 0 0 2012 0 0 0 3122 69441 0 0 0
mockITS 82685 0 0 1 1 0 0 17954 1 0 0 33689 79930 0 0 0
mockITS1 78942 0 0 1 1 0 0 16111 0 0 0 31473 76322 0 0 0
O10T 90383 0 0 0 0 0 0 5890 0 0 0 11378 87120 0 0 0
O11T 49393 0 0 0 0 0 0 1248 0 0 0 2028 47526 0 0 0
O12T 82557 0 0 0 0 0 0 3095 0 0 0 5599 79568 0 0 0
O2T 86638 0 0 0 0 0 0 6116 0 0 0 12092 83795 0 0 0
O3T 89327 0 0 0 0 0 0 10358 0 0 0 18191 85950 0 0 0
O4T 78314 0 0 0 0 0 0 3269 0 0 0 5680 75476 0 0 0
O5T 82614 0 0 0 0 0 0 7039 0 0 0 11972 79701 0 0 0
O6T 77044 0 0 0 0 0 0 3715 0 0 0 6935 74449 0 0 0
O8T 80788 0 0 0 0 0 0 5730 0 0 0 11518 79342 0 0 0
O9T 70605 0 0 0 0 0 0 4127 0 0 0 6807 67977 0 0 0
O9T1 86912 0 0 1 1 0 0 5731 1 0 0 10308 83948 0 0 0
T10T 51784 0 0 0 0 0 0 8305 0 0 0 14446 49815 0 0 0
T10T1 92874 0 0 0 1 0 0 14787 1 0 0 25950 89680 0 0 0
T11T 74391 0 0 0 0 0 0 1949 0 0 0 4664 71800 0 0 0
T12T 86740 0 0 0 0 0 0 22301 0 0 0 32543 83731 0 0 0
T12T1 92359 0 0 0 0 0 0 13065 0 0 0 21035 89499 0 0 0
T5T 80626 0 0 0 0 0 0 6055 0 0 0 9938 77903 0 0 0
T9T 84065 0 0 0 0 0 0 2688 0 0 0 3891 81492 0 0 0
T9T1 91939 0 0 0 0 0 0 5586 0 0 0 9799 88958 0 0 0
U3T 64310 0 0 0 1 0 0 1325 1 0 0 2716 62055 0 0 0
FW.fw.Forward FW.fw.Complement FW.fw.Reverse FW.fw.RevComp FW.rv.Forward FW.rv.Complement FW.rv.Reverse FW.rv.RevComp RV.fw.Forward RV.fw.Complement RV.fw.Reverse RV.fw.RevComp RV.rv.Forward RV.rv.Complement RV.rv.Reverse RV.rv.RevComp
A1T 63653 0 0 0 0 0 0 4 0 0 0 16 62935 0 0 0
A2T 64125 0 0 0 1 0 0 10 1 0 0 13 63342 0 0 0
A2T1 83939 0 0 0 0 0 0 0 0 0 0 2 82776 0 0 0
A3T 64353 0 0 0 1 0 0 1 1 0 0 6 63473 0 0 0
A3T1 64896 0 0 0 3 0 0 2 3 0 0 7 64144 0 0 0
A4T 47381 0 0 0 0 0 0 1 0 0 0 3 46953 0 0 0
A4T1 60805 0 0 0 1 0 0 5 1 0 0 19 61391 0 0 0
A5T 78884 0 0 0 2 0 0 203 2 0 0 1569 81228 0 0 0
A6T 96701 0 0 0 0 0 0 2 0 0 0 6 96089 0 0 0
A6T1 46396 0 0 0 0 0 0 353 0 0 0 1378 46903 0 0 0
A7T 61965 0 0 0 0 0 0 6 0 0 0 17 61111 0 0 0
A8T 65411 0 0 0 3 0 0 12 1 0 0 83 64690 0 0 0
A9T 67579 0 0 0 3 0 0 3 4 0 0 17 66986 0 0 0
C1T 64170 0 0 0 1 0 0 2 0 0 0 6 63546 0 0 0
C2T 52220 0 0 0 0 0 0 1 0 0 0 3 51441 0 0 0
C3T 60544 0 0 0 1 0 0 1 1 0 0 7 59731 0 0 0
C4T 64441 0 0 0 1 0 0 0 0 0 0 6 62853 0 0 0
C5T 56084 0 0 0 2 0 0 3 2 0 0 6 55094 0 0 0
C6T 60380 0 0 0 0 0 0 3 0 0 0 4 59438 0 0 0
F1T 86897 0 0 0 0 0 0 0 0 0 0 10 85657 0 0 0
F3T 62047 0 0 0 0 0 0 2 0 0 0 3 61334 0 0 0
FU1T 58131 0 0 0 1 0 0 3 1 0 0 9 57621 0 0 0
FU2T 55921 0 0 0 2 0 0 3 2 0 0 8 55135 0 0 0
FU3T 56426 0 0 0 2 0 0 2 2 0 0 3 55624 0 0 0
FU3T1 63504 0 0 0 3 0 0 2 2 0 0 7 62943 0 0 0
HT 53937 0 0 0 0 0 0 0 0 0 0 0 53360 0 0 0
mock16S 52943 0 0 0 4 0 0 0 4 0 0 3 51888 0 0 0
mock16S1 44966 0 0 0 3 0 0 0 3 0 0 5 44047 0 0 0
O10T 62655 0 0 0 0 0 0 0 0 0 0 4 61474 0 0 0
O11T 61037 0 0 0 2 0 0 4 2 0 0 4 59977 0 0 0
O12T 56698 0 0 0 0 0 0 1 0 0 0 6 55962 0 0 0
O1T 57881 0 0 0 1 0 0 1 1 0 0 5 56938 0 0 0
O2T 46587 0 0 0 0 0 0 1 0 0 0 3 45578 0 0 0
O3T 63209 0 0 0 0 0 0 3 0 0 0 10 62312 0 0 0
O4T 66018 0 0 0 1 0 0 1 2 0 0 1 65185 0 0 0
O5T 66160 0 0 0 1 0 0 0 1 0 0 2 65345 0 0 0
O6T 52017 0 0 0 0 0 0 1 0 0 0 2 51475 0 0 0
O7T 81704 0 0 0 0 0 0 3 0 0 0 17 81564 0 0 0
O8T 59378 0 0 0 1 0 0 0 1 0 0 5 58426 0 0 0
O9T 55057 0 0 0 1 0 0 1 1 0 0 4 54122 0 0 0
O9T1 55066 0 0 0 0 0 0 0 0 0 0 0 54575 0 0 0
T10T 62680 0 0 0 1 0 0 0 1 0 0 1 62262 0 0 0
T10T1 67617 0 0 0 1 0 0 0 1 0 0 10 67569 0 0 0
T11T 50304 0 0 0 0 0 0 2 0 0 0 6 49669 0 0 0
T11T1 58175 0 0 0 1 0 0 0 1 0 0 1 56927 0 0 0
T12T 55569 0 0 0 0 0 0 1 0 0 0 1 54907 0 0 0
T12T1 85881 0 0 0 0 0 0 1 0 0 0 5 85123 0 0 0
T1T 78272 0 0 0 0 0 0 0 0 0 0 1 77752 0 0 0
T3T 90936 0 0 0 0 0 0 424 0 0 0 1336 92033 0 0 0
T5T 64045 0 0 0 0 0 0 1 0 0 0 4 63362 0 0 0
T9T 60004 0 0 0 0 0 0 6 0 0 0 35 60612 0 0 0
T9T1 57229 0 0 0 0 0 0 0 0 0 0 3 56654 0 0 0
U3T 55430 0 0 0 0 0 0 0 0 0 0 1 54317 0 0 0

In this table, you can sporadically spot the odd badly oriented primer, but it’s such a small fraction of the dataset (e.g. in A9T) , that I would not bother trying to filter those out. Readthrough however, seems to be pretty common. Here I calculate the percentage readthrough for all libraries:

# For ITS
XITS$proportion_revcomp_fw <- XITS$primersummary[,8]/XITS$primersummary[,1]
XITS$proportion_revcomp_rv <- XITS$primersummary[,12]/XITS$primersummary[,13]

# For 16S
X16S$proportion_revcomp_fw <- X16S$primersummary[,8]/X16S$primersummary[,1]
X16S$proportion_revcomp_rv <- X16S$primersummary[,12]/X16S$primersummary[,13]
plot(XITS$proportion_revcomp_fw,XITS$proportion_revcomp_rv,log="xy",col="white",cex=0.75, main = "ITS: proportion of reads with readthrough of the entire primer")
text(XITS$proportion_revcomp_fw,XITS$proportion_revcomp_rv,labels=sample.names,cex=0.75)

plot(X16S$proportion_revcomp_fw + 1e-06 ,X16S$proportion_revcomp_rv + 1e-06 , log="xy",col="white",cex=0.75, main = "16S: proportion of reads with readthrough of the entire primer")
text(X16S$proportion_revcomp_fw + 1e-06 ,X16S$proportion_revcomp_rv + 1e-06 , labels=sample.names,cex=0.75)

There is virtually no readthrough for the 16S samples, and when there is some, it’s correlated between forward and reverse. I had to add a small quantity to the 16S counts in order to correct for zeros.

For ITS, in the forward libraries, the proportion ranges from 0.5% to 25%. In the reverse libraries, it’s between 0.9% and 42%. The plot shows that forward and reverse are correlated, as expected.
Samples C5T and C6T are clearly outliers in this respect, with very few detected readthroughs. What we can tentatively conclude for ITS at this point:

Next step is cutadapt. I installed this on my local machine, and ran it via a system call:

cutadapt <- "/usr/bin/cutadapt" 
system2(cutadapt, args = "--version")

# Generate a subdirectory that will contain the cutadapt trimmed files
XITS$path.cut <- file.path(XITS$path, "cutadapt"); if(!dir.exists(XITS$path.cut)) dir.create(XITS$path.cut)
X16S$path.cut <- file.path(X16S$path, "cutadapt"); if(!dir.exists(X16S$path.cut)) dir.create(X16S$path.cut)

XITS$fnFs.cut <- file.path(XITS$path.cut, basename(XITS$fnFs))
XITS$fnRs.cut <- file.path(XITS$path.cut, basename(XITS$fnRs))

X16S$fnFs.cut <- file.path(X16S$path.cut, basename(X16S$fnFs))
X16S$fnRs.cut <- file.path(X16S$path.cut, basename(X16S$fnRs))

XITS$FWD.RC <- dada2:::rc(XITS$FWD)
XITS$REV.RC <- dada2:::rc(XITS$REV)

X16S$FWD.RC <- dada2:::rc(X16S$FWD)
X16S$REV.RC <- dada2:::rc(X16S$REV)

# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
XITS$R1.flags <- paste("-g", XITS$FWD, "-a", XITS$REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
XITS$R2.flags <- paste("-G", XITS$REV, "-A", XITS$FWD.RC)

# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
X16S$R1.flags <- paste("-g", X16S$FWD, "-a", X16S$REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
X16S$R2.flags <- paste("-G", X16S$REV, "-A", X16S$FWD.RC)


# Run Cutadapt for ITS
for(i in seq_along(XITS$fnFs)) {
  system2("cutadapt", args = c(XITS$R1.flags, XITS$R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
                             "-o", XITS$fnFs.cut[i], "-p", XITS$fnRs.cut[i], # output files
                             XITS$fnFs.filtN[i], XITS$fnRs.filtN[i])) # input files
}


# Run Cutadapt for 16S
for(i in seq_along(X16S$fnFs)) {
  system2("cutadapt", args = c(X16S$R1.flags, X16S$R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
                             "-o", X16S$fnFs.cut[i], "-p", X16S$fnRs.cut[i], # output files
                             X16S$fnFs.filtN[i], X16S$fnRs.filtN[i])) # input files
}



# For ITS:
XITS$primersummary.2 <- FUNS$generate_primer_summary(XITS, "fnFs.cut", "fnRs.cut")
# For 16S:
X16S$primersummary.2 <- FUNS$generate_primer_summary(X16S, "fnFs.cut", "fnRs.cut")
FW.fw.Forward FW.fw.Complement FW.fw.Reverse FW.fw.RevComp FW.rv.Forward FW.rv.Complement FW.rv.Reverse FW.rv.RevComp RV.fw.Forward RV.fw.Complement RV.fw.Reverse RV.fw.RevComp RV.rv.Forward RV.rv.Complement RV.rv.Reverse RV.rv.RevComp
A1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A3T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A4T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A4T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A6T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A7T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A8T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A9T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
C1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU3T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
mockITS 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0
mockITS1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
O10T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O11T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O12T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O4T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O8T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O9T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O9T1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0
T10T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T10T1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
T11T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T12T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T12T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T9T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T9T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
U3T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
FW.fw.Forward FW.fw.Complement FW.fw.Reverse FW.fw.RevComp FW.rv.Forward FW.rv.Complement FW.rv.Reverse FW.rv.RevComp RV.fw.Forward RV.fw.Complement RV.fw.Reverse RV.fw.RevComp RV.rv.Forward RV.rv.Complement RV.rv.Reverse RV.rv.RevComp
A1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
A2T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A3T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
A3T1 0 0 0 0 3 0 0 0 3 0 0 0 0 0 0 0
A4T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A4T1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
A5T 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0
A6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A6T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A7T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A8T 0 0 0 0 3 0 0 0 1 0 0 0 0 0 0 0
A9T 0 0 0 0 3 0 0 0 4 0 0 0 0 0 0 0
C1T 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
C2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C3T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
C4T 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
C5T 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0
C6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU1T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
FU2T 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0
FU3T 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0
FU3T1 0 0 0 0 3 0 0 0 2 0 0 0 0 0 0 0
HT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
mock16S 0 0 0 0 4 0 0 0 4 0 0 0 0 0 0 0
mock16S1 0 0 0 0 3 0 0 0 3 0 0 0 0 0 0 0
O10T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O11T 0 0 0 0 2 0 0 0 2 0 0 0 0 0 0 0
O12T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O1T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
O2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O4T 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0
O5T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
O6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O7T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O8T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
O9T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
O9T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T10T 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
T10T1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
T11T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T11T1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0
T12T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T12T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T9T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T9T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
U3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Now you can see the primers have gone, except for the sporadic badly oriented primer.

0.4 Step 4 : Trimmomatic

Trimmomatic has multiple important parameters:

This part has taken me quite some optimization. When we go too stringent, the reads end up too small. At first sight this is not a problem, but then afterwards when pairs have to get merged, they don’t have sufficient overlap anymore. So the parameters I end up using here, are stringent enough to generally exclude shitty base calls, but leave reads of sufficient length for merging. If afterwards it would turn out we get rid of samples C5T and U3T, we can crank up the stringency.

# Forward and reverse fastq filenames have the format:
XITS$cutFs <- sort(list.files(XITS$path.cut, pattern = "_1.fq.gz", full.names = TRUE))
XITS$cutRs <- sort(list.files(XITS$path.cut, pattern = "_2.fq.gz", full.names = TRUE))

dir.create(file.path(XITS$path.cut, "trim_paired"))
dir.create(file.path(XITS$path.cut, "trim_unpaired"))
XITS$trimFs_paired   <- file.path(XITS$path.cut, "trim_paired", basename(XITS$cutFs))
XITS$trimFs_unpaired <- file.path(XITS$path.cut, "trim_unpaired", basename(XITS$cutFs))
XITS$trimRs_paired   <- file.path(XITS$path.cut, "trim_paired", basename(XITS$cutRs))
XITS$trimRs_unpaired <- file.path(XITS$path.cut, "trim_unpaired", basename(XITS$cutRs))

for(i in 1:length(sample.names))
{  
command <- paste(c("java -jar ~/Trimmomatic-0.39/trimmomatic-0.39.jar PE",
XITS$cutFs[i],
XITS$cutRs[i],
XITS$trimFs_paired[i],
XITS$trimFs_unpaired[i],
XITS$trimRs_paired[i],
XITS$trimRs_unpaired[i],
"LEADING:0 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:100"),collapse=" ")

system(command) # This is actually the call to execute the Trimmomatic call outlined above
}
# Forward and reverse fastq filenames have the format:
X16S$cutFs <- sort(list.files(X16S$path.cut, pattern = "_1.fq.gz", full.names = TRUE))
X16S$cutRs <- sort(list.files(X16S$path.cut, pattern = "_2.fq.gz", full.names = TRUE))

dir.create(file.path(X16S$path.cut, "trim_paired"))
dir.create(file.path(X16S$path.cut, "trim_unpaired"))
X16S$trimFs_paired   <- file.path(X16S$path.cut, "trim_paired", basename(X16S$cutFs))
X16S$trimFs_unpaired <- file.path(X16S$path.cut, "trim_unpaired", basename(X16S$cutFs))
X16S$trimRs_paired   <- file.path(X16S$path.cut, "trim_paired", basename(X16S$cutRs))
X16S$trimRs_unpaired <- file.path(X16S$path.cut, "trim_unpaired", basename(X16S$cutRs))

for(i in 1:length(sample.names))
{  
command <- paste(c("java -jar ~/Trimmomatic-0.39/trimmomatic-0.39.jar PE",
X16S$cutFs[i],
X16S$cutRs[i],
X16S$trimFs_paired[i],
X16S$trimFs_unpaired[i],
X16S$trimRs_paired[i],
X16S$trimRs_unpaired[i],
"LEADING:0 TRAILING:25 SLIDINGWINDOW:4:25 MINLEN:100"),collapse=" ")

system(command) # This is actually the call to execute the Trimmomatic call outlined above
}

Also here, you should go through all of them.

plotQualityProfile(c(XITS$trimFs_paired[5],XITS$trimRs_paired[5]))

plotQualityProfile(c(X16S$trimFs_paired[5],X16S$trimRs_paired[5]))

0.5 Step 5 : the real filterAndTrim

This is a last filtering step, not very necessary anymore after Trimmomatic, but we go for it anwyay. The most important additional trimming happens on the total expected errors per read. With a maximum of 0.5 expected errors per read, we should be on the safe side. However, in my experience you can go down much lower (like < 0.05 easily) if your trimmomatic is more stringent. However, you will not be able to merge a lot of reads in C5T and U3T for the ITS.

Notice we are only working with the reads that came out of Trimmomatic as intact pairs. The reads that got orphaned, are not taken along the rest of the analysis.

XITS$filtFs <- file.path(XITS$path.cut, "filtered", basename(XITS$cutFs))
XITS$filtRs <- file.path(XITS$path.cut, "filtered", basename(XITS$cutRs))

XITS$out <- filterAndTrim(XITS$trimFs_paired, XITS$filtFs, XITS$trimRs_paired, XITS$filtRs, 
                     maxN = 0, 
                     maxEE = c(0.5,0.5), 
                     truncQ = 2, 
                     minLen = 100, 
                     rm.phix = TRUE, 
                     matchIDs=T, 
                     compress = TRUE, 
                     multithread = 12) 


# Same for 16S

X16S$filtFs <- file.path(X16S$path.cut, "filtered", basename(X16S$cutFs))
X16S$filtRs <- file.path(X16S$path.cut, "filtered", basename(X16S$cutRs))

X16S$out <- filterAndTrim(X16S$trimFs_paired, X16S$filtFs, X16S$trimRs_paired, X16S$filtRs, 
                     maxN = 0, 
                     maxEE = c(0.5,0.5), 
                     truncQ = 2, 
                     minLen = 100, 
                     rm.phix = TRUE, 
                     matchIDs=T, 
                     compress = TRUE, 
                     multithread = 12) 

0.6 Step 6: The DADA2 functions

Detailed info about these functions can be found in the dada2 documentation. We start with learning the error profiles. nbases is set high enough so the entire dataset should be used for training.

XITS$errF <- learnErrors(XITS$filtFs, multithread = 12, nbases= 1e+10)
XITS$errR <- learnErrors(XITS$filtRs, multithread = 12, nbases= 1e+10)
X16S$errR <- learnErrors(X16S$filtFs, multithread = 12, nbases= 1e+10)
X16S$errR <- learnErrors(X16S$filtRs, multithread = 12, nbases= 1e+10)

We can plot the observed error profile for each transition/transversion and compare to theoretical values. I used these plots a lot for the optimization of the filtering parameters.

0.6.1 Error profiles for ITS, forward and reverse:

plotErrors(XITS$errF, nominalQ=TRUE)

plotErrors(XITS$errR, nominalQ=TRUE)

0.6.2 Error profiles for 16S, forward and reverse:

plotErrors(X16S$errR, nominalQ=TRUE)

plotErrors(X16S$errR, nominalQ=TRUE)

For 16S they don’t look super good. It does not seem to be a major concern (see here). We may want to look if additional trimming or filtering improves the error profiles, but I currently don’t consider it a priority.

Next we move on with dereplication, and the real dada function:

XITS$derepFs  <- derepFastq(XITS$filtFs, verbose=TRUE)
XITS$derepRs  <- derepFastq(XITS$filtRs, verbose=TRUE)
# Name the derep-class objects by the sample names
names(XITS$derepFs)  <- XITS$sample.names
names(XITS$derepRs) <- XITS$sample.names

X16S$derepFs <- derepFastq(X16S$filtFs, verbose=TRUE)
X16S$derepRs <- derepFastq(X16S$filtRs, verbose=TRUE)
names(X16S$derepFs) <- X16S$sample.names
names(X16S$derepRs) <- X16S$sample.names


XITS$dadaFs <- dada(XITS$derepFs, err=XITS$errF, multithread=12)
XITS$dadaRs <- dada(XITS$derepRs, err=XITS$errR, multithread=12)

X16S$dadaFs <- dada(X16S$derepFs, err=X16S$errR, multithread=12)
X16S$dadaRs <- dada(X16S$derepRs, err=X16S$errR, multithread=12)

XITS$mergers <- mergePairs(XITS$dadaFs, XITS$derepFs, XITS$dadaRs, XITS$derepRs, verbose=TRUE, maxMismatch=1)
XITS$seqtab  <- makeSequenceTable(XITS$mergers)
table(nchar(getSequences(XITS$seqtab)))
plot(table(nchar(getSequences(XITS$seqtab))))
XITS$seqtab.nochim <- removeBimeraDenovo(XITS$seqtab, method="consensus", multithread=16, verbose=TRUE)

X16S$mergers <- mergePairs(X16S$dadaFs, X16S$derepFs, X16S$dadaRs, X16S$derepRs, verbose=TRUE, maxMismatch=1)
X16S$seqtab  <- makeSequenceTable(X16S$mergers)
table(nchar(getSequences(X16S$seqtab)))
plot(table(nchar(getSequences(X16S$seqtab))))
X16S$seqtab.nochim <- removeBimeraDenovo(X16S$seqtab, method="consensus", multithread=16, verbose=TRUE)

A bimera is a two-parent chimera, in which the left side is made up of one parent sequence, and the right-side made up of a second parent sequence.

For ITS, we identified 142 bimeras out of 4954 input sequences. Nothing to worry about. For 16S, we identified 1987 bimeras out of 11820 input sequences, for 16S. That’s almost 17%, which is a lot. Not sure if we can do a lot about it. It’s something that typically happens during the PCR phase. What would worry me more is that they are perhaps false positive bimera’s. Anyhow, for now they are just removed for the remainder of the analysis, which is certainly the safest option.

Now reads have been filtered, errors have been learned, doubles have been dereplicated, dada has been executed, reads have been merged and chimeras have been removed. Let’s summarise this in a table.

XITS$track <- cbind(XITS$out, sapply(XITS$dadaFs, FUNS$getN), sapply(XITS$dadaRs, FUNS$getN), sapply(XITS$mergers, FUNS$getN), rowSums(XITS$seqtab.nochim))
X16S$track <- cbind(X16S$out, sapply(X16S$dadaFs, FUNS$getN), sapply(X16S$dadaRs, FUNS$getN), sapply(X16S$mergers, FUNS$getN), rowSums(X16S$seqtab.nochim))

# If processing a single sample, remove the sapply calls: e.g. replace sapply(dadaFs, FUNS$getN) with FUNS$getN(dadaFs)
colnames(XITS$track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")
colnames(X16S$track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged", "nonchim")

rownames(XITS$track) <- XITS$sample.names
rownames(X16S$track) <- X16S$sample.names

0.6.3 Summary of read numbers throughout filtering for ITS:

input filtered denoisedF denoisedR merged nonchim
A1T 70386 70266 69452 68912 67097 63584
A2T 68132 68007 67496 67035 63102 62928
A2T1 73934 73791 73668 73544 71463 71122
A3T 72770 72630 71946 71185 68587 68362
A3T1 61226 61112 60182 59156 55828 55528
A4T 72194 72074 71949 71832 71018 71018
A4T1 85347 85192 85052 85121 84877 84799
A6T 71979 71855 71767 71714 71222 71222
A6T1 86518 86349 86276 86079 82296 82272
A7T 78905 78756 77638 76699 73327 72872
A8T 71930 71730 71113 70852 67771 67438
A9T 72184 72073 71421 71182 69944 69824
C1T 59307 59174 58745 58253 57473 57273
C2T 49926 49832 49326 48675 47362 47362
C3T 72283 72125 71621 70740 69682 69664
C5T 58387 58066 57728 57554 57086 57077
C6T 59801 59697 59235 58777 57292 57249
F1T 68057 67931 67874 67752 65447 65220
F3T 66860 66721 66469 66008 65045 65022
FU1T 78449 78331 77705 77199 74796 74264
FU2T 61114 60997 60372 59555 57030 57023
FU3T 71972 71848 71141 70313 63472 63472
FU3T1 69865 69712 68952 68013 58476 58281
HT 64761 64632 64462 64355 63734 63725
mockITS 75521 75385 74118 75289 72236 67997
mockITS1 72539 72415 72249 72365 70824 66778
O10T 81955 81790 81205 80688 78492 78211
O11T 45587 45487 45045 44566 43558 43533
O12T 76520 76304 75707 75026 73465 73401
O2T 80001 79800 79079 78480 76224 76224
O3T 81145 80955 79963 79149 75292 75108
O4T 70441 70258 69832 69376 68470 68455
O5T 75464 75286 74609 73869 70933 70737
O6T 71320 71174 70793 70299 68382 68317
O8T 73888 73683 72160 70893 66184 65720
O9T 64053 63924 63357 62422 60155 59923
O9T1 79935 79775 79546 78990 76642 76642
T10T 47337 47217 46742 46071 44448 44448
T10T1 84002 83822 83568 83065 80652 80583
T11T 67118 66983 66624 66030 63206 63206
T12T 78917 78778 78440 77814 74555 74555
T12T1 84568 84376 84200 83901 80964 80964
T5T 73430 73269 73086 72866 71579 70170
T9T 76561 76388 76319 76258 75535 75531
T9T1 84016 83742 83596 83384 78524 78406
U3T 50804 50605 50185 49728 48543 48543

0.6.4 Summary of read numbers throughout filtering For 16S

input filtered denoisedF denoisedR merged nonchim
A1T 58298 58088 50174 54312 26564 25958
A2T 59045 58852 51300 56418 27990 27025
A2T1 77702 77461 75414 75608 58887 57620
A3T 59004 58822 51207 55144 25989 25467
A3T1 59571 59368 50065 55048 23452 22995
A4T 43614 43430 41868 41885 30600 29690
A4T1 57779 57598 55834 55053 44921 44571
A5T 75612 75312 69389 68726 52896 42132
A6T 89876 89614 86524 86897 67940 67259
A6T1 43847 43721 41482 41083 26800 26678
A7T 56757 56580 48268 52726 24378 23725
A8T 60761 60592 53770 57797 33529 32979
A9T 62742 62573 54932 59880 33967 32687
C1T 59037 58856 49470 54285 22477 21698
C2T 47876 47709 39898 43217 19114 17974
C3T 55986 55825 46618 51598 22893 22073
C4T 55637 55369 47507 52512 17220 17012
C5T 51019 50832 43880 46847 24104 23166
C6T 55322 55130 46604 51304 22733 21824
F1T 79476 79198 76276 77110 55924 54773
F3T 57117 56920 51240 52693 28579 28209
FU1T 53282 53093 45593 50117 24051 23543
FU2T 51260 51130 42048 46627 17654 17481
FU3T 52126 51945 44034 47499 21735 21389
FU3T1 59302 59134 49675 54582 24739 24426
HT 49285 49127 42564 44980 24441 24161
mock16S 48115 47954 40647 42134 30603 19789
mock16S1 41017 40899 34302 36737 25857 16894
O10T 57144 56959 49240 53676 28606 26030
O11T 56200 56040 49131 52007 26907 25675
O12T 52273 52100 43421 47156 20621 20139
O1T 53105 52915 42512 45399 17293 16958
O2T 40212 40013 31356 34343 10743 10684
O3T 57702 57484 47716 50322 18462 18249
O4T 60708 60516 54144 57072 30096 29179
O5T 61312 61116 52113 56020 27024 26500
O6T 47732 47578 40517 42785 20708 20440
O7T 76867 76642 74805 74893 60205 59498
O8T 54225 54066 43183 47091 17032 16773
O9T 50287 50101 43630 45059 20049 19831
O9T1 51039 50853 47589 47010 24532 24275
T10T 57982 57827 50041 54640 26778 26176
T10T1 63223 63048 58180 59669 35932 35390
T11T 46092 45960 39493 42511 15092 14541
T11T1 52358 52170 42624 45825 17902 17707
T12T 51667 51514 46382 47994 25378 24604
T12T1 79278 79028 76560 76755 47963 46678
T1T 72564 72347 69379 70421 56332 55152
T3T 86046 85830 64480 63764 50385 50312
T5T 59064 58880 50515 54745 26744 26028
T9T 54808 54582 52519 52930 37135 36792
T9T1 53110 52936 50312 49845 32607 32058
U3T 49627 49455 41747 45740 18741 18603

Notice that this workflow has been optimized to retain as many read pairs as possible. This has been a deliberate choice in order to prevent biases. However, we may want to repeat some things with more stringent criteria. This depends on whether we decide to keep all samples on board.

0.7 Step 7 : Mock communities

We have observed that our mock communities dominate the first dimension of MDS (not shown), which is expected. First, we take a look at sequences that are unique to our mock communities:

# Subset seqtab.nochim to only contain the mocks
XITS$mocks <- XITS$seqtab.nochim[grepl("mock",rownames(XITS$seqtab.nochim)),]
X16S$mocks <- X16S$seqtab.nochim[grepl("mock",rownames(X16S$seqtab.nochim)),]

# Check which sequences get at least one count in either mock 1 or mock 2: 
unname(which(XITS$mocks[1,] > 0 | XITS$mocks[2,] > 0))
unname(which(X16S$mocks[1,] > 0 | X16S$mocks[2,] > 0))

# Retrieve the sequences that correspond to these ASV's
XITS$mock_sequences <- colnames(XITS$mocks)[unname(which(XITS$mocks[1,] > 0 | XITS$mocks[2,] > 0))]
X16S$mock_sequences <- colnames(X16S$mocks)[unname(which(X16S$mocks[1,] > 0 | X16S$mocks[2,] > 0))]

# You can use these to check whether you find all taxons that you expect in your mocks, and whether you find something that you did not intend to include. 
# Next, you can check how many times you find sequences from your mock in other samples: 
XITS$mocks_table           <- XITS$seqtab.nochim[,unname(which(XITS$mocks[1,] > 0 | XITS$mocks[2,] > 0))]
colnames(XITS$mocks_table) <- paste ("sequence",1:ncol(XITS$mocks_table))

X16S$mocks_table           <- X16S$seqtab.nochim[,unname(which(X16S$mocks[1,] > 0 | X16S$mocks[2,] > 0))]
colnames(X16S$mocks_table) <- paste ("sequence",1:ncol(X16S$mocks_table))

Check the taxonomy of the mock samples. Notice that this is not the most efficient way of doing this, given that it takes a long time to load the references in RAM, and we will repeat this afterwards for the other samples.

X16S$mocks_taxa <- assignTaxonomy(X16S$mock_sequences,"/home/rik/fukushima/metabarcoding/silva_nr99_v138.1_wSpecies_train_set.fa.gz",verbose=T,multithread=10)
XITS$mocks_taxa <- assignTaxonomy(XITS$mock_sequences,"/home/rik/fukushima/metabarcoding/sh_general_release_dynamic_s_all_04.04.2024.fasta",verbose=T,multithread=10)

To be checked: for ITS, mock sequences 1 and sequence 6 occur in non-mock samples.

0.7.1 ITS mock sequences:

sequence 1 sequence 2 sequence 3 sequence 4 sequence 5 sequence 6 sequence 7 sequence 8 sequence 9 sequence 10 sequence 11 sequence 12 sequence 13 sequence 14 sequence 15 sequence 16 sequence 17 sequence 18
A1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2T1 571 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A3T 0 0 0 0 0 23 0 0 0 0 0 0 0 0 0 0 0 0
A3T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A4T 0 0 0 0 0 325 0 0 0 0 0 0 0 0 0 0 0 0
A4T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A6T 1790 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A6T1 1362 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0 0 0
A7T 0 0 0 0 0 210 0 0 0 0 0 0 0 0 0 0 0 0
A8T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A9T 0 0 0 0 0 13 0 0 0 0 0 0 0 0 0 0 0 0
C1T 0 0 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0
C2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C3T 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 0 0 0
C5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F1T 7579 0 0 0 0 150 0 0 0 0 0 0 0 0 0 0 0 0
F3T 15 0 0 0 0 21 0 0 0 0 0 0 0 0 0 0 0 0
FU1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU2T 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 0
FU3T 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0 0 0 0
FU3T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
mockITS 8684 12946 9141 6795 5630 5307 5402 5717 2545 2128 1603 1691 0 212 0 113 45 38
mockITS1 8263 13451 8937 6899 5516 4527 5622 4607 2574 1996 1614 1506 858 216 125 0 33 34
O10T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O11T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O12T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O4T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O6T 0 0 0 0 0 30 0 0 0 0 0 0 0 0 0 0 0 0
O8T 0 0 0 0 0 69 0 0 0 0 0 0 0 0 0 0 0 0
O9T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O9T1 0 0 0 0 0 126 0 0 0 0 0 0 0 0 0 0 0 0
T10T 2889 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T10T1 14482 0 0 0 0 149 0 0 0 0 0 0 0 0 0 0 0 0
T11T 1906 0 0 0 0 33 0 0 0 0 0 0 0 0 0 0 0 0
T12T 8 0 0 0 0 23 0 0 0 0 0 0 0 0 0 0 0 0
T12T1 0 0 0 0 0 44 0 0 0 0 0 0 0 0 0 0 0 0
T5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T9T 704 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T9T1 3001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
U3T 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0 0 0

0.7.2 ITS mock taxonomy:

Don’t forget to scroll to the right!
Kingdom Phylum Class Order Family Genus Species
CGCACCTTGCGCTCCTCGGTGTTCCGAGGAGCATGCCTGTTTGAGTGTCAGTAAATTCTCAACCCCTCTCGATTTGCTTCGAGCGGGTGCTTGGATGTTGGGGGCTGCCGGAGACACTGGATTCGTCCAGGACTCGGGCTCCTCTTAAATGAATCGGCTCGCGGTCGACTTTCGACTTTGCATGACAAGGCCTTTGGCGTGATAATGATCGCCGTTCGCCGAAGTGCAAGACAGAATGGTCCCGTGCCTCTAATGCGTCGACGCCTCTAGGGGCGTCTTCCTCATTGACGTTTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA k__Fungi p__Basidiomycota c__Agaricomycetes o__Boletales f__Suillaceae g__Suillus NA
CGCACATTGCGCCCCTTGGTATTCCAGGGGGCATGCCTGTTTGAGCGTCATTTCCTTCTCAAACATTCTGTTTGGTAGTGAGTGATACTCTTTGGAGTTAACTTGAAATTGCTGGCCTTTTCATTGGATGTTTTTTTTTTCCAAAGAGAGGTTTCTCTGCGTGCTTGAGGTATAATGCAAGTACGGTCGTTTTAGGTTTTACCAACTGCGGCTAATCTTTTTTATACTGAGCGTATTGGAACGTTATCGATAAGAAGAGAGCGTCTAGGCGAACAATGTTCTTAAAGTTTGACCTCAAATCAGGTAGGAGTACCCGCTGAACTTAA k__Fungi p__Ascomycota c__Saccharomycetes o__Saccharomycetales f__Saccharomycetaceae g__Saccharomyces s__bayanus
CGCACATTGCGCCCCTTGGTATTCCATGGGGCATGCCTGTTCGAGCGTCATTTGTACCTTCAAGCTATGCTTGGTGTTGGGTGTTTGTCTCGCCTTTTGCGTGTAGACTCGCCTTAAAACAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCGCTTTGCACTCATAACGACGACATCCAAAAGTCATTTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA k__Fungi p__Ascomycota NA NA NA NA NA
CGCACATTGCGCCCCCTGGTATTCCGGGGGGCATGCCTGTCCGAGCGTCATTGCTGCCCTCAAGCACGGCTTGTGTGTTGGGCCTCCGTCCCCCACCCGGGGGACGGGCCCGAAAGGCAGCGGCGGCACCGTGTCCGGTCCTCGAGCGTATGGGGCTTTGTCACCCGCTCTGTAGGCCCGGCCGGCGCCTGCCGACCCTCATCATCCTTTTTTCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA k__Fungi p__Ascomycota c__Eurotiomycetes o__Eurotiales f__Aspergillaceae g__Penicillium s__adametzii
CGCACCTTGCGCTCCTTGGTATTCCGAGGAGCATGCCTGTTTGAGTGTCATGAAATCTTCAACTTACAGACCTTTGCGGGTTTGTAGGCTTGGACTTGGAGGCTTGTCGGCCGTGTTTCGGCCGGCTCCTCTTAAATGCATTAGCTTGATTCCTTGCGGATCGGCTCTCGGTGTGATAATGTCTACGCCGCGACCGTGAAGCGTTTTGGCGAGCTTCTAACCGTCCTCGTTTGGACAGCTTTATGACCTCTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA k__Fungi p__Basidiomycota c__Agaricomycetes o__Polyporales f__Ganodermataceae g__Ganoderma NA
CGCACATTGCGCCCCCTGGTATTCCGGGGGGCATGCCTGTTCGAGCGTCATTTCACCACTCAAGCCTCGCTTGGTATTGGGCAACGCGGTCCGCCGCGTGCCTCAAATCGACCGGCTGGGTCTTCTGTCCCCTAAGCGTTGTGGAAACTATTCGCTAAAGGGTGTTCGGGAGGCTACGCCGTAAAACAACCCCATTTCTAAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA k__Fungi p__Ascomycota c__Dothideomycetes o__Cladosporiales f__Cladosporiaceae g__Cladosporium s__herbarum
CGCACCTTGCGCTCCTTGGTATTCCGAGGAGCATGCCTGTTTGAGTGTCATGAAATCTTCAACTTACAGACCTTTGCAGGTTTGTAGGCTTGGACTTGGAGGCTTGTCGGCCGTGTTTCGGCCGGCTCCTCTTAAATGCATTAGCTTGATTCCTTGCGGATCGGCTCTCGGTGTGATAATGTCTACGCCGCGACCGTGAAGCGTTTTGGCGAGCTTCTAACCGTCCTCGTTTGGACAGCTTTATGACCTCTGACCTCAAATCAGGTAGGACTACCCGCTGAACTTAA k__Fungi p__Basidiomycota c__Agaricomycetes o__Polyporales f__Ganodermataceae g__Ganoderma NA
CGCACATTGCGCCCTTTGGTATTCCAAAGGGCATGCCTGTTCGAGCGTCATTTGTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAACTTTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA k__Fungi p__Ascomycota c__Dothideomycetes o__Pleosporales f__Pleosporaceae g__Alternaria NA
CGCAACTTGCGCCCTTTGGTATTCCGAAGGGCATGCCTGTTTGAGAGTCATGAAAATCTCAATCCCTCGGGTTTTATTACCTGTTGGACTTGGATTTGGGTGTTTGCCGCGACCTGCAAAGGACGTCGGCTCGCCTTAAATGTGTTAGTGGGAAGGTGATTACCTGTCAGCCCGGCGTAATAAGTTTCGCTGGGCCTATGGGGTAGTCTTCGGCTTGCTGATAACAACCATCTCTTTTTGTTTGACCTCAAATCAGGTAGGGCTACCCGCTGAACTTAA k__Fungi p__Basidiomycota c__Tremellomycetes o__Tremellales f__Cryptococcaceae g__Cryptococcus s__neoformans
CGCACATTGCGCCCTCTGGTATTCCGGAGGGCATGCCTGTCCGAGCGTCATTGCTGCCCTCAAGCCCGGCTTGTGTGTTGGGCCCCGTCCCCCCCGCCGGGGGGACGGGCCCGAAAGGCAGCGGCGGCACCGCGTCCGGTCCTCGAGCGTATGGGGCTTCGTCACCCGCTCTAGTAGGCCCGGCCGGCGCCAGCCGACCCCCAACCTTTAATTATCTCAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA k__Fungi p__Basidiomycota c__Agaricomycetes o__Agaricales f__Amanitaceae g__Amanita s__caesareoides
CGCACATGGCGCCTTCCAGTATCCTGGGAGGCATGCCTGTCCGAGCGTCGTTTCAACCCTCGAGCCCCCGTGGCCCGGCGTTGGGGATCTGCCCAGGCAGTCCCCGAAAACCAGTGGCGGACCCGTTACAGGCCCTTCCCTTGCGTAGTAGCATTCGCCTCGCATCGGGAGCCGGCGGGCTTCCAGCCTCTAAACCCCCATCAAGTCCGCCCCGGCGGCACCAAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA k__Fungi p__Ascomycota c__Sordariomycetes o__Glomerellales f__Plectosphaerellaceae g__Gibellulopsis NA
CGCACATTGCGCCCCTTGGTATTCCCTGGGGCATGCCTGTTCGAGCGTCATTTGTACCTTCAAGCTATGCTTGGTGTTGGGTGTTTGTCTCGCCTTTTGCGTGTAGACTCGCCTTAAAACAATTGGCAGCCGGCGTATTGATTTCGGAGCGCAGTACATCTCGCGCTTTGCACTCATAACGACGACATCCAAAAGTCATTTTTTACACTCTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA k__Fungi p__Ascomycota c__Dothideomycetes o__Pleosporales f__Didymellaceae NA NA
CGCACATTGCGCCCTTTGGTATTCCCAAGGGCATGCCTGTTCGAGCGTCATTTGTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTGTCTCTAGCTTTGCTGGAGACTCGCCTTAAAGTAATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCACTCTCTATCAGCAAAGGTCTAGCATCCATTAAGCCTTTTTTTCAACTTTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA k__Fungi NA NA NA NA NA NA
CGCACATGGCGCCTTCCCGTATCCTGGGAGGCATGCCTGTCCGAGCGTCGTTTCAACCCTCGAGCCCCCGTGGCCCGGCGTTGGGGATCTGCCCAGGCAGTCCCCGAAAACCAGTGGCGGACCCGTTACAGGCCCTTCCCTTGCGTAGTAGCATTCGCCTCGCATCGGGAGCCGGCGGGCTTCCAGCCTCTAAACCCCCATCAAGTCCGCCCCGGCGGCACCAAGGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA k__Fungi p__Ascomycota c__Sordariomycetes o__Glomerellales f__Plectosphaerellaceae g__Gibellulopsis NA
TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACTTAAGTTGTGTCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTGAATCATCGAATAAGCGGAGGACTGTCTCTTATACACATCTCCGAGCCCACGAGACCACTCAATTCATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAA NA NA NA NA NA NA NA
TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACACCCTGATACAATCGTCGGCAGCGTCAGATGTGTATAAGAGACAGGTGAATCATCGAATAAGCGGAGGACTGTCTCTTATACACATCTCCGAGCCCACGAGACTGCCGGTCAGATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAA NA NA NA NA NA NA NA
CGCACATTGCGCCCGGTGGTATTCCGCCGGGCATGCCTGTTCGAGCGTCATTATAACCACTCAAGCCTGGCTTGGTATTGGGGTACGCGGTCCCGCGGCTCCTAAAATCAGTGGCGGTGCCGGTGGGCTCTAAGCGTAGTACATCTCCTCGCTATAGGGTCCCTCTGGTTGCTTGCCAAAACCCCCCATTTTTACAGGTTGACCTCGGATCAGGTAGGGATACCCGCTGAACTTAA NA NA NA NA NA NA NA
CGCACATTGCGCCCGCTGGTATTCCGGCGGGCATGCCTGTTCGAGCGTCATTTCAACCCTCAAGCCCTCGGGTTTGGTGTTGGGGATCGGCTCTGCCTTCTGGCGGTGCCGCCCCCGAAATACATTGGCGGTCTCGCTGCAGCCTCCATTGCGTAGTAGCTAACACCTCGCAACTGGAACGCGGCGCGGCCATGCCGTAAAACCCCAACTTCTGAATGTTGACCTCGGATCAGGTAGGAATACCCGCTGAACTTAA k__Fungi p__Ascomycota c__Sordariomycetes o__Hypocreales f__Nectriaceae g__Fusarium s__tricinctum

0.7.3 16S mock sequences:

sequence 1 sequence 2 sequence 3 sequence 4 sequence 5 sequence 6 sequence 7 sequence 8 sequence 9 sequence 10 sequence 11 sequence 12 sequence 13 sequence 14 sequence 15 sequence 16
A1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A2T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A3T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A4T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A4T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A6T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A7T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A8T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A9T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C4T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
C6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
F3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
FU3T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
HT 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
mock16S 3409 3034 3050 2741 1935 1432 649 586 585 556 545 477 434 313 41 2
mock16S1 2896 2586 2563 2443 1381 1264 648 497 494 481 443 471 414 313 0 0
O10T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O11T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O12T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O2T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O4T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O6T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O7T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O8T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O9T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
O9T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T10T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T10T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T11T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T11T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T12T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T12T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T1T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T5T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T9T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
T9T1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
U3T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0.7.4 16S mock taxonomy:

Don’t forget to scroll to the right!

Kingdom Phylum Class Order Family Genus Species
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGTGTTGTGGTTAATAACCGCAGCAATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTTGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Salmonella enterica
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia-Shigella coli
TAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTAGGGAAGAACAAGTACCGTTCGAATAGGGCGGTACCTTGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGCAGGCGGTTCCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA Bacteria Firmicutes Bacilli Bacillales Bacillaceae Bacillus halotolerans
TAGGGAATCTTCCACAATGGGCGCAAGCCTGATGGAGCAACACCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAGCTCTGTTGTTAAAGAAGAACACGTATGAGAGTAACTGTTCATACGTTGACGGTATTTAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGAGAGTGCAGGCGGTTTTCTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACA Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Limosilactobacillus NA
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGCAGTAAGTTAATACCTTGCTGTTTTGACGTTACCAACAGAATAAGCACCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTCAGCAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCCAAAACTACTGAGCTAGAGTACGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTGATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas aeruginosa
TAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAACTCTGTTATTAGGGAAGAACATATGTGTAAGTAACTGTGCACATCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACA Bacteria Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus aureus
TAGGGAATCTTCCACAATGGGCGCAAGCCTGATGGAGCAACCCCGCGTGAGTGAAGAAGGGTTTCGGCTCGTAAAGCTCTGTTGTTAAAGAAGAACACGTATGAGAGTAACTGTTCATACGTTGACGGTATTTAACCAGAAAGTCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGATTTATTGGGCGTAAAGAGAGTGCAGGCGGTTTTCTAAGTCTGATGTGAAAGCCTTCGGCTTAACCGGAGAAGTGCATCGGAAACTGGATAACTTGAGTGCAGAAGAGGGTAGTGGAACTCCATGTGTAGCGGTGGAATGCGTAGATATATGGAAGAACACCAGTGGCGAAGGCGGCTACCTGGTCTGCAACTGACGCTGAGACTCGAAAGCATGGGTAGCGAACA Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Limosilactobacillus NA
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGTTGTAGATTAATACTCTGCAATTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCAAAACTGACTGACTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTAATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas azotoformans
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGTGGGGAGGAAGGGAGTAAAGTTAATACCTTTGCTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Escherichia-Shigella NA
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGTGTTGCGGTTAATAACCGCAGCAATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTCTGTCAAGTCGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAACTGGCAGGCTTGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Enterobacteriaceae Salmonella enterica
TAGGGAATCTTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGTATGAAGAAGGTTTTCGGATCGTAAAGTACTGTTGTTAGAGAAGAACAAGGATAAGAGTAACTGCTTGTCCCTTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGCGCGCAGGCGGTCTTTTAAGTCTGATGTGAAAGCCCCCGGCTTAACCGGGGAGGGTCATTGGAAACTGGAAGACTGGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGATATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGCGCGAAAGCGTGGGGAGCAAACA Bacteria Firmicutes Bacilli Lactobacillales Listeriaceae Listeria monocytogenes
TAGGGAATATTCCGCAATGGACGAAAGTCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTTTTCGGATCGTAAAGCTCTGTTGTTAGGGAAGAACAAGTACCGTTCGAATAGGGCGGTACCTTGACGGTACCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGAATTATTGGGCGTAAAGGGCTCGCAGGCGGTTCCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGGAACTTGAGTGCAGAAGAGGAGAGTGGAATTCCACGTGTAGCGGTGAAATGCGTAGAGATGTGGAGGAACACCAGTGGCGAAGGCGACTCTCTGGTCTGTAACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACA Bacteria Firmicutes Bacilli Bacillales Bacillaceae Bacillus intestinalis
TAGGGAATCTTCGGCAATGGACGAAAGTCTGACCGAGCAACGCCGCGTGAGTGAAGAAGGTTTTCGGATCGTAAAACTCTGTTGTTAGAGAAGAACAAGGACGTTAGTAACTGAACGTCCCCTGACGGTATCTAACCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGGCGGTTTCTTAAGTCTGATGTGAAAGCCCCCGGCTCAACCGGGGAGGGTCATTGGAAACTGGGAGACTTGAGTGCAGAAGAGGAGAGTGGAATTCCATGTGTAGCGGTGAAATGCGTAGATATATGGAGGAACACCAGTGGCGAAGGCGGCTCTCTGGTCTGTAACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACA Bacteria Firmicutes Bacilli Lactobacillales Enterococcaceae Enterococcus faecalis
TAGGGAATCTTCCGCAATGGGCGAAAGCCTGACGGAGCAACGCCGCGTGAGTGATGAAGGTCTTCGGATCGTAAAGCTCTGTTATTAGGGAAGAACATATGTGTAAGTAACTGTGCACATCTTGACGGTACCTAATCAGAAAGCCACGGCTAACTACGTGCCAGCAGCCGCGGTAATACGTAGGTGGCAAGCGTTATCCGGAATTATTGGGCGTAAAGCGCGCGTAGGCGGTTTTTTAAGTCTGATGTGAAAGCCCACGGCTCAACCGTGGAGGGTCATTGGAAACTGGAAAACTTGAGTGCAGAAGAGGAAAGTGGAATTCCATGTGTAGCGGTGAAATGCGCAGAGATATGGAGGAACACCAGTGGCGAAGGCGACTTTCTGGTCTGTAACTGACGCTGATGTGCGAAAGCGTGGGGATCAAACA Bacteria Firmicutes Bacilli Staphylococcales Staphylococcaceae Staphylococcus aureus
TGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTGTGAAGAAGGCCTTAGGGTTGTAAAGCACTTTCAGCGAGGAGGAAGGGTTCAGTGTTAATAGCACTGTGCATTGACGTTACTCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGAGCTTAACTTGGGAACTGCATTTGAAACTGGCAAGCTAGAGTCTTGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACAAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Proteobacteria Gammaproteobacteria Enterobacterales Yersiniaceae Serratia plymuthica
TGGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCCATGCCGCGTGTGTGAAGAAGGTCTTCGGATTGTAAAGCACTTTAAGTTGGGAGGAAGGGTTGTAGATTAATACTCTGCAATTTTGACGTTACCGACAGAATAAGCACCGGCTAACTCTGTGCCAGCAGCCGCGGTAATACAGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCGCGTAGGTGGTTTGTTAAGTTGGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTCGAAAACTGACTGACTAGAGTATGGTAGAGGGTGGTGGAATTTCCTGTGTAGCGGTGAAATGCGTAGATATAGGAAGGAACACCAGTGGCGAAGGCGACCACCTGGACTAATACTGACACTGAGGTGCGAAAGCGTGGGGAGCAAACA Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas NA

0.8 Step 8 : some cleaning and reducing of the data:

0.8.1 Not so important for now: some filtering in ITS

We have found 4812 unique sequences, which is an awful lot. Many of them look alike, which I presume is mostly due to evolutionary relationships, but which is sometimes also due presence of some unrecognized bits of primers. I will cluster sequences using 5-mer distances, and if sequences look extremely similar, I may merge some.

Furthermore, some other unexpected sequences are present, which we filter out as well.

One thing I noticed, is that on rare occasions, sequences are actually virtually the same, except for a primer part that has not been recognized because its not similar enough to the primer sequences we gave for filtering. I inferred this from 5-mer distances (see below), and just want to show one example as an illustration here:

seqs4 <- DNAStringSet(x=colnames(XITS$seqtab.nochim)[c(1,2563,2611,3097)])
aln4  <- msa(seqs4)
## use default substitution matrix

For now I don’t manage to output the alignment. Don’t bother.

#knitr::raw_latex(msaPrettyPrint(x= aln4 , output="asis"))

knitr::pandoc_to()

[1] “html”

#knitr::raw_latex(msaPrettyPrint(x= aln4 , output="asis"))

knitr::raw_latex("\\emph{some text}")
#knitr::asis_output(msaPrettyPrint(x= aln4 , output="asis"))

If we look how often these are counted, we observe that in this case, there clearly is a main variant (variant 1). Variant 2 is just a SNP away from variant 1, which may well reflect a natural situation. It occurs in only one sample (A4T) and given its close proximity to variant 1 and its low frequency compared to the main variant, it can easily be “collapsed” with variant 1.

Variants 3 and 4 also occur in small numbers in single samples, but their origin is less clear to me. If they are a sequencing artifact, it’s worrisome that they occur in 38 and 25 copies. If they would just be a natural variant of the primer binding region, there is less reason to worry and they can be collapsed with variant 1 as well. BUT THIS HAS TO BE CHECKED

unname(XITS$seqtab.nochim[,c(1,2563,2611,3097)])
variant 1 variant 2 variant 3 variant 4
A1T 0 0 0 0
A2T 19460 0 0 0
A2T1 24988 0 0 0
A3T 0 0 0 0
A3T1 0 0 0 0
A4T 21478 38 0 0
A4T1 33565 0 0 25
A6T 10394 0 0 0
A6T1 7064 0 0 0
A7T 0 0 0 0
A8T 11989 0 0 0
A9T 3314 0 0 0
C1T 0 0 0 0
C2T 0 0 0 0
C3T 0 0 0 0
C5T 0 0 0 0
C6T 41 0 0 0
F1T 33 0 0 0
F3T 0 0 0 0
FU1T 0 0 0 0
FU2T 0 0 0 0
FU3T 0 0 0 0
FU3T1 0 0 0 0
HT 0 0 0 0
mockITS 0 0 0 0
mockITS1 0 0 0 0
O10T 0 0 0 0
O11T 0 0 0 0
O12T 0 0 0 0
O2T 0 0 0 0
O3T 0 0 0 0
O4T 0 0 0 0
O5T 0 0 0 0
O6T 0 0 0 0
O8T 0 0 0 0
O9T 2090 0 0 0
O9T1 2373 0 0 0
T10T 0 0 0 0
T10T1 0 0 0 0
T11T 6035 0 0 0
T12T 94 0 0 0
T12T1 30 0 0 0
T5T 35329 0 37 0
T9T 0 0 0 0
T9T1 0 0 0 0
U3T 0 0 0 0

Another way to look at it, is to check the last couple of nucleotides of our sequences. They should almost invariably read “CTTAA” for ITS, and “AAACA” for 16S

table(unlist(lapply(colnames(XITS$seqtab.nochim),FUNS$substrRight,5)))
## 
## AAAAA AGGCT CCTAA CTTAA CTTAT GAGGA GGAGA TTAAG TTTAA 
##    17     1     1  4602     1     3     2    24   161

Sequences ending in AAAAA are not fungal ITS. They are probably bacterial, and may just be carryover from other sequencing projects. Only 651 counts (17 sequences) in total. The other endings are either some unrecognized primers (GAGGA and GGAGA, 5 sequences, 218 total counts), something unknown (AGGCT) or some variations of the amplicon (TTTAA, CCTAA,TTAAG). Given that all exceptions are very small number of sequences, we will only proceed with the amplicons ending at CTTAA, TTTAA, CCTAA and TTAAG

unname(XITS$seqtab.nochim[,which(unlist(lapply(colnames(XITS$seqtab.nochim), FUNS$substrRight,5))=="AAAAA")])
XITS$X <- XITS$seqtab.nochim[,unlist(lapply(colnames(XITS$seqtab.nochim), FUNS$substrRight,5) %in% c("CTTAA", "TTTAA", "CCTAA", "TTAAG"))]

0.8.2 Removing mocks from data for further analysis:

We remove the mock samples, and also we discard those ASV’s that only belong to the mock samples:

XITS$X <- XITS$X[ ! rownames(XITS$X) %in% c("mockITS","mockITS1"),]
XITS$X <- XITS$X[,colSums(XITS$X) > 0]

# same for 16S

X16S$X <- X16S$seqtab.nochim
X16S$X <- X16S$X[ ! rownames(X16S$X) %in% c("mock16S","mock16S1"),]
X16S$X <- X16S$X[,colSums(X16S$X) > 0]

0.9 Step 9 : OTU clustering at 97.5% identity

NOTE TO SELF: we could (should?) have done this with CD-HIT as well. It’s probably less cumbersome.

We will now see if we can cluster variants based on k-mer distance. I do this in a two-step procedure for speed. First, we calculate the 5-mer distance between all sequences. I noticed that small differences, such as a SNP or the presence/absence of a primer was typically a 5-mer distance of > 0.001 but < 0.01. We cluster based on an initial k-mer distance of 5%. In a next step, we cluster within these clusters, based on real alignments, going to 97.5% sequence identity. This is our final cut-off for OTU clustering.

XITS$otus97.5 <- FUNS$otu_cluster(XITS$X)
X16S$otus97.5 <- FUNS$otu_cluster(X16S$X)

# Now we can generate count matrices for the otus rather than for ASVs

XITS$X_otu97.5            <- as.data.frame(XITS$X)
colnames(XITS$X_otu97.5)  <- as.character(XITS$otus97.5)
XITS$X_otu97.5            <- as.matrix(as.data.frame(lapply(split.default(XITS$X_otu97.5, f = XITS$otus97.5), rowSums)))

X16S$X_otu97.5            <- as.data.frame(X16S$X)
colnames(X16S$X_otu97.5)  <- as.character(X16S$otus97.5)
X16S$X_otu97.5            <- as.matrix(as.data.frame(lapply(split.default(X16S$X_otu97.5, f = X16S$otus97.5), rowSums)))

0.9.1 Step 10: show that technical duplicates are similar, and merge them. Check and remove outlier samples.

We will base our decisions about merging and removing on alpha and beta diversity measures, because they take into account to what extent samples are similar to each other. First we generate sublists within XITS and X16S that will contain alpha diversity for the clustered OTU’s. We will use these throughout the rest of the pipeline, but in this manner, we can add a very similar analysis for other OTU cutoffs if we wish to do so.

XITS$alpha_otu97.5 <- list()
X16S$alpha_otu97.5 <- list()

XITS$alpha_asv <- list()
X16S$alpha_asv <- list()

We will now look at most common alpha diversity measures, both for ITS and 16S, on the OTU level. The OTU’s are clustered at 97.5% identity, which is very stringent. I only do this to prevent sequencing errors or super rare variants from slipping in, influencing our results. In order to make sure we are not introducing too much bias, I first do two quick checks: are some diversity measures correlated with sample size (sequencing depth)? And, are some diversity measures influenced a lot by using the OTU’s rather than the ASV’s? For this quick check I use the alpha function from the microbiome package, since it unites most common alpha diversity measures.

XITS$alpha_otu97.5$alpha <- cbind.data.frame(microbiome::alpha(t(XITS$X_otu97.5)),"total"=rowSums(XITS$X_otu97.5))
X16S$alpha_otu97.5$alpha <- cbind.data.frame(microbiome::alpha(t(X16S$X_otu97.5)),"total"=rowSums(X16S$X_otu97.5))

XITS$alpha_asv$alpha <- cbind.data.frame(microbiome::alpha(t(XITS$X)),"total"=rowSums(XITS$X))
X16S$alpha_asv$alpha <- cbind.data.frame(microbiome::alpha(t(X16S$X)),"total"=rowSums(X16S$X))

0.9.1.1 Biases

Next, I visualize potential biases with scatterplots. For the correlations between sequencing depth and alpha diversity measures, I use a red colour in case the correlation is significant (p<0.05).

# Relationship between OTU's and ASV's for ITS:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
  plot(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_asv$alpha[,i],main=paste("ITS:",colnames(XITS$alpha_otu97.5$alpha)[i]),xlab="OTU",ylab="ASV")
}

# Correlation with sequencing depth for ITS:
par(mfrow=c(6,4))

for (i in c(1:17,19:22))
{
  p      <- cor.test(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23])$p.value
  r      <- cor(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23])
  corcol <- ifelse(p<0.05,yes = "darkred", no ="black") 
  plot(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23],main=paste("ITS:", colnames(XITS$alpha_otu97.5$alpha)[i]),xlab="alpha diversity estimate", ylab= "sequencing depth",col=corcol)
}

# Relationship between OTU's and ASV's for 16S:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
  plot(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_asv$alpha[,i],main=paste("16S:", colnames(X16S$alpha_otu97.5$alpha)[i]),xlab="OTU",ylab="ASV")
}

# Correlation with sequencing depth for 16S:
par(mfrow=c(6,4))

for (i in c(1:17,19:22))
{
  p      <- cor.test(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23])$p.value
  r      <- cor(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23])
  corcol <- ifelse(p<0.05,yes = "darkred", no ="black") 
  plot(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23],main=paste("16S:", colnames(X16S$alpha_otu97.5$alpha)[i]),xlab="alpha diversity estimate", ylab= "sequencing depth",col=corcol)
}

I conclude that for most indices, OTU vs ASV does not really matter. “dominance_core_abundance” cannot be calculated for ASV’s. The highest correlations are obviously in richness/diversity, while dominance and especially rarity will be impacted a bit more. The correlation is weakest for evenness Camargo.

For what’s concerned correlation between indices and sample size: there may be problem, because there are typically significant correlations, though never strong. Nasrin: you looked into saturation plots. They should be added here.

0.9.1.2 Outliers:

I find little evidence for outliers at the level of ITS, but for 16S there are two samples that are clearly outliers: A5T and T3T This becomes clear with a quick pairs plot presenting a sample of alpha diversity measures, where A5T is colored red, and T3T orange. A5T and T3T must be dominated by a couple of very abundant OTU’s, while otherwise being composed of many rare taxa.

We will also find out that these samples have an extreme influence on ordination of beta diversity measures, and I think it would be best to exclude them altogether, also because they don’t have particularly interesting values for 137Cs.

cols <- rep("black",51)
cols[8] <- "red"       # A5T
cols[47] <- "orange"   # T3T
pairs(X16S$alpha_otu97.5$alpha[c(1,4,5,6,8,9,11,14,22)],col=cols,pch=15)

0.9.1.3 How similar are technical duplicates?

We will use some simple beta diversity based distance plots to show that technical replicates are very similar:

XITS$beta_asv <- list()
X16S$beta_asv <- list()

XITS$beta_otu97.5 <- list()
X16S$beta_otu97.5 <- list()

Bray Curtis distance based on the ASVs. Notice how the pairs of technical replicates are close to each other, but not necessarily always overlapping. Also notice how C5T and C6T, outliers on the level of readthroughs, seem to behave perfectly normal here.

XITS$beta_asv$BCdist <- vegan::vegdist(XITS$X, method = "bray")
FUNS$betaplot(XITS$beta_asv$BCdist,2,"ITS: NMDS based on Bray Curtis distance")
## Run 0 stress 0.2415223 
## Run 1 stress 0.2389663 
## ... New best solution
## ... Procrustes: rmse 0.100488  max resid 0.5750439 
## Run 2 stress 0.2379415 
## ... New best solution
## ... Procrustes: rmse 0.05456371  max resid 0.2427361 
## Run 3 stress 0.2379416 
## ... Procrustes: rmse 9.343905e-05  max resid 0.0004233384 
## ... Similar to previous best
## Run 4 stress 0.2379415 
## ... Procrustes: rmse 0.0001734942  max resid 0.0007668742 
## ... Similar to previous best
## Run 5 stress 0.2421108 
## Run 6 stress 0.2379415 
## ... Procrustes: rmse 1.574039e-05  max resid 9.149001e-05 
## ... Similar to previous best
## Run 7 stress 0.2382823 
## ... Procrustes: rmse 0.0483783  max resid 0.2352576 
## Run 8 stress 0.2379415 
## ... Procrustes: rmse 3.071133e-05  max resid 7.968857e-05 
## ... Similar to previous best
## Run 9 stress 0.2421101 
## Run 10 stress 0.2442064 
## Run 11 stress 0.2415224 
## Run 12 stress 0.2614215 
## Run 13 stress 0.2434778 
## Run 14 stress 0.2400955 
## Run 15 stress 0.2722964 
## Run 16 stress 0.238213 
## ... Procrustes: rmse 0.04154184  max resid 0.2394212 
## Run 17 stress 0.2685232 
## Run 18 stress 0.2389663 
## Run 19 stress 0.2415223 
## Run 20 stress 0.2400955 
## *** Best solution repeated 4 times

mean(XITS$beta_asv$BCdist)
## [1] 0.9592109
median(XITS$beta_asv$BCdist)
## [1] 0.9886755
as.matrix(XITS$beta_asv$BCdist)["A2T","A2T1"]
## [1] 0.432957
as.matrix(XITS$beta_asv$BCdist)["A3T","A3T1"]
## [1] 0.2553679
as.matrix(XITS$beta_asv$BCdist)["A4T","A4T1"]
## [1] 0.2558528
as.matrix(XITS$beta_asv$BCdist)["A6T","A6T1"]
## [1] 0.2464886
as.matrix(XITS$beta_asv$BCdist)["T10T","T10T1"]
## [1] 0.4866073
as.matrix(XITS$beta_asv$BCdist)["T12T","T12T1"]
## [1] 0.523499
as.matrix(XITS$beta_asv$BCdist)["O9T","O9T1"]
## [1] 0.3184464
as.matrix(XITS$beta_asv$BCdist)["FU3T","FU3T1"]
## [1] 0.4055995
# see also:

plot(XITS$mocks[1,],XITS$mocks[2,],log="xy")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 4796 x values <= 0 omitted
## from logarithmic plot
## Warning in xy.coords(x, y, xlabel, ylabel, log): 4795 y values <= 0 omitted
## from logarithmic plot

For 16S, it looks less nice. I’m sure the culprit is the outliers A5T and T3T (see further). Try running it again without them.

X16S$beta_asv$BCdist <- vegan::vegdist(X16S$X, method = "bray")
FUNS$betaplot(X16S$beta_asv$BCdist,2,"16S: NMDS based on Bray Curtis distance")
## Run 0 stress 0.2060517 
## Run 1 stress 0.1643125 
## ... New best solution
## ... Procrustes: rmse 0.09496938  max resid 0.4188215 
## Run 2 stress 0.1671694 
## Run 3 stress 0.2148475 
## Run 4 stress 0.1671446 
## Run 5 stress 0.165424 
## Run 6 stress 0.2139795 
## Run 7 stress 0.1711995 
## Run 8 stress 0.1914279 
## Run 9 stress 0.2088918 
## Run 10 stress 0.1723366 
## Run 11 stress 0.211506 
## Run 12 stress 0.1921776 
## Run 13 stress 0.1643056 
## ... New best solution
## ... Procrustes: rmse 0.001073852  max resid 0.005743202 
## ... Similar to previous best
## Run 14 stress 0.1712615 
## Run 15 stress 0.1721679 
## Run 16 stress 0.2175024 
## Run 17 stress 0.1712155 
## Run 18 stress 0.1723369 
## Run 19 stress 0.2070686 
## Run 20 stress 0.1723368 
## *** Best solution repeated 1 times

mean(X16S$beta_asv$BCdist)
## [1] 0.9351964
median(X16S$beta_asv$BCdist)
## [1] 0.9656546
as.matrix(X16S$beta_asv$BCdist)["A2T","A2T1"]
## [1] 0.612074
as.matrix(X16S$beta_asv$BCdist)["A3T","A3T1"]
## [1] 0.4490529
as.matrix(X16S$beta_asv$BCdist)["A4T","A4T1"]
## [1] 0.471351
as.matrix(X16S$beta_asv$BCdist)["A6T","A6T1"]
## [1] 0.5988588
as.matrix(X16S$beta_asv$BCdist)["T10T","T10T1"]
## [1] 0.4246175
as.matrix(X16S$beta_asv$BCdist)["T12T","T12T1"]
## [1] 0.6122163
as.matrix(X16S$beta_asv$BCdist)["O9T","O9T1"]
## [1] 0.5738448
as.matrix(X16S$beta_asv$BCdist)["FU3T","FU3T1"]
## [1] 0.4506166
# see also:

plot(X16S$mocks[1,],X16S$mocks[2,],log="xy")
## Warning in xy.coords(x, y, xlabel, ylabel, log): 9806 x values <= 0 omitted
## from logarithmic plot
## Warning in xy.coords(x, y, xlabel, ylabel, log): 9808 y values <= 0 omitted
## from logarithmic plot

# So let's now collapse/sum duplicates:
XITS$x      <- FUNS$collapse_duplicates(XITS$X)
X16S$x      <- FUNS$collapse_duplicates(X16S$X)

# Remove outliers:
X16S$x      <- X16S$x[!c(rownames(X16S$x) %in% c("A5T","T3T")),]

# Same for X_otu97.5
XITS$x_otu97.5     <- FUNS$collapse_duplicates(XITS$X_otu97.5)
X16S$x_otu97.5     <- FUNS$collapse_duplicates(X16S$X_otu97.5)
X16S$x_otu97.5     <- X16S$x_otu97.5[!c(rownames(X16S$x_otu97.5) %in% c("A5T","T3T")),]

Now we can move on to the proper alpha and beta diversity analyses!

0.10 Step 10 : alpha diversity

We start with richness, choosing the simple measure of the total number of taxa.

XITS$alpha_otu97.5 <- list()
X16S$alpha_otu97.5 <- list()

XITS$alpha_asv <- list()
X16S$alpha_asv <- list()
XITS$alpha_otu97.5$alpha <- cbind.data.frame(microbiome::alpha(t(XITS$x_otu97.5)),"total"=rowSums(XITS$x_otu97.5))
X16S$alpha_otu97.5$alpha <- cbind.data.frame(microbiome::alpha(t(X16S$x_otu97.5)),"total"=rowSums(X16S$x_otu97.5))

XITS$alpha_asv$alpha <- cbind.data.frame(microbiome::alpha(t(XITS$x)),"total"=rowSums(XITS$x))
X16S$alpha_asv$alpha <- cbind.data.frame(microbiome::alpha(t(X16S$x)),"total"=rowSums(X16S$x))

Next, once again, we visualize potential biases with scatterplots. For the correlations between sequencing depth and alpha diversity measures, I use a red colour in case the correlation is significant (p<0.05).

# Relationship between OTU's and ASV's for ITS:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
  plot(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_asv$alpha[,i],main=paste("ITS:",colnames(XITS$alpha_otu97.5$alpha)[i]),xlab="OTU",ylab="ASV")
}

# Correlation with sequencing depth for ITS:
par(mfrow=c(6,4))

for (i in c(1:17,19:22))
{
  p      <- cor.test(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23])$p.value
  r      <- cor(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23])
  corcol <- ifelse(p<0.05,yes = "darkred", no ="black") 
  plot(XITS$alpha_otu97.5$alpha[,i],XITS$alpha_otu97.5$alpha[,23],main=paste("ITS:", colnames(XITS$alpha_otu97.5$alpha)[i]),xlab="alpha diversity estimate", ylab= "sequencing depth",col=corcol)
}

# Relationship between OTU's and ASV's for 16S:
par(mfrow=c(6,4))
for (i in c(1:17,19:22))
{
  plot(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_asv$alpha[,i],main=paste("16S:", colnames(X16S$alpha_otu97.5$alpha)[i]),xlab="OTU",ylab="ASV")
}

# Correlation with sequencing depth for 16S:
par(mfrow=c(6,4))

for (i in c(1:17,19:22))
{
  p      <- cor.test(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23])$p.value
  r      <- cor(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23])
  corcol <- ifelse(p<0.05,yes = "darkred", no ="black") 
  plot(X16S$alpha_otu97.5$alpha[,i],X16S$alpha_otu97.5$alpha[,23],main=paste("16S:", colnames(X16S$alpha_otu97.5$alpha)[i]),xlab="alpha diversity estimate", ylab= "sequencing depth",col=corcol)
}

# start with classical richness estimates:
XITS$alpha_otu97.5$richness  <- microbiome::richness(t(XITS$x_otu97.5),index="observed")
X16S$alpha_otu97.5$richness  <- microbiome::richness(t(X16S$x_otu97.5),index="observed")
rich_all <- merge(XITS$alpha_otu97.5$richness,X16S$alpha_otu97.5$richness,by="row.names",suffixes=c("ITS","16S"))

# No meaningfull correlation between richness of ITS or 16S samples
pairs(rich_all[,-1])

par(mfrow=c(2,3))
plot(XITS$alpha_otu97.5$richness[,1] ~ log10(META[rownames(XITS$alpha_otu97.5$richness),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$richness),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$richness)[1]),ylab=colnames(XITS$alpha_otu97.5$richness)[1],xlab="log10 137 Cs")
plot(X16S$alpha_otu97.5$richness[,1] ~ log10(META[rownames(X16S$alpha_otu97.5$richness),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$richness),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$richness)[1]),ylab=colnames(X16S$alpha_otu97.5$richness)[1],xlab="log10 137 Cs")

 rm(rich_all)

0.10.1 Diversity

We move on to the five diversity measures from the microbiome diversity function. Zero counts are included here, but excluding them did not change any of the conclusions (it did not make a big difference in the measures anyhow). Some things I noticed:

  • In absolute numbers, there is a way higher diversity at the 16S level than at the level of ITS. Moreover, the range of diversity estimates is way higher in ITS than in 16S, because the 16S samples are typically all squished in values very close to maximal diversity.
  • There is a faint but significant correlation between diversity at the bacterial and fungal level for a couple of diversity measures. In other words: higher diversity in bacteria is correlated with higher diversity in fungi.
  • Plotting the diversity indices per group, as a function of 137Cs levels, does not reveal any spectacular trends.
  • There is one clear outlier for bacterial diversity: T3T , an otherwise unremarkable sample from Tsushima forest, has a very low diversity. We will later see this is caused by a dominance of one OTU.
# start with classical diversity estimates:
XITS$alpha_otu97.5$diversity  <- microbiome::diversity(t(XITS$x_otu97.5))
X16S$alpha_otu97.5$diversity  <- microbiome::diversity(t(X16S$x_otu97.5))
div_all         <- merge(XITS$alpha_otu97.5$diversity,X16S$alpha_otu97.5$diversity,by="row.names",suffixes=c("ITS","16S"))

# First diversity plots for ITS:

par(mfrow=c(2,3))
for(i in 1:5)
{
plot(XITS$alpha_otu97.5$diversity[,i] ~ log10(META[rownames(XITS$alpha_otu97.5$diversity),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$diversity),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$diversity)[i]),ylab=colnames(XITS$alpha_otu97.5$diversity)[i],xlab="log10 137 Cs")
}

# Same for bacteria
par(mfrow=c(2,3))

for(i in 1:5)
{
plot(X16S$alpha_otu97.5$diversity[,i] ~ log10(META[rownames(X16S$alpha_otu97.5$diversity),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$diversity),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$diversity)[i]),ylab=colnames(X16S$alpha_otu97.5$diversity)[i],xlab="log10 137 Cs")
}


pairs(div_all[,-1])

# Inverse Simpson: p-value = 0.001955
#cor.test(div_all[,2],div_all[,7])
# Gini Simpson: p-value = 0.7551
#cor.test(div_all[,3],div_all[,8])
# Shannon: p-value = 0.02814
#cor.test(div_all[,4],div_all[,9])
# Fisher: p-value = 0.844
#cor.test(div_all[,5],div_all[,10])
# Coverage: p-value = 0.001783
#cor.test(div_all[,6],div_all[,11])

rm(div_all)

0.10.2 Evenness

Similarly, very few trends. Only a super weak negative correlation at the level of Simpson evenness

# start with classical evenness estimates:
XITS$alpha_otu97.5$evenness  <- microbiome::evenness(t(XITS$x_otu97.5))
X16S$alpha_otu97.5$evenness  <- microbiome::evenness(t(X16S$x_otu97.5))
# First evenness plots for ITS:

eve_all <- merge(XITS$alpha_otu97.5$evenness,X16S$alpha_otu97.5$evenness,by="row.names",suffixes=c("ITS","16S"))


par(mfrow=c(2,3))
for(i in 1:5)
{
plot(XITS$alpha_otu97.5$evenness[,i] ~ log10(META[rownames(XITS$alpha_otu97.5$evenness),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$evenness),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$evenness)[i]),ylab=colnames(XITS$alpha_otu97.5$evenness)[i],xlab="log10 137 Cs")
}

# Same for bacteria
par(mfrow=c(2,3))

for(i in 1:5)
{
plot(X16S$alpha_otu97.5$evenness[,i] ~ log10(META[rownames(X16S$alpha_otu97.5$evenness),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$evenness),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$evenness)[i]),ylab=colnames(X16S$alpha_otu97.5$evenness)[i],xlab="log10 137 Cs")
}


pairs(eve_all[,-1])

#cor.test(eve_all[,2],eve_all[,7])
#cor.test(eve_all[,3],eve_all[,8])
cor.test(eve_all[,4],eve_all[,9]) # Simpson evenness
## 
##  Pearson's product-moment correlation
## 
## data:  eve_all[, 4] and eve_all[, 9]
## t = -1.3485, df = 33, p-value = 0.1867
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5220266  0.1133414
## sample estimates:
##        cor 
## -0.2285372
#cor.test(eve_all[,5],eve_all[,10])
#cor.test(eve_all[,6],eve_all[,11])

rm(eve_all)

0.10.3 Dominance

  • Notice how for 16S, T3T is completely dominated by one OTU.
  • There are no significant correlations between dominance measures between ITS and 16S
# start with classical dominance estimates:
XITS$alpha_otu97.5$dominance  <- microbiome::dominance(t(XITS$x_otu97.5))[,-1] # exclude dbp, which is essentially the same as "relative dominance"
X16S$alpha_otu97.5$dominance  <- microbiome::dominance(t(X16S$x_otu97.5))[,-1]
dom_all <- merge(XITS$alpha_otu97.5$dominance,X16S$alpha_otu97.5$dominance,by="row.names",suffixes=c("ITS","16S"))

# First dominance plots for ITS:

par(mfrow=c(2,3))
for(i in 1:6)
{
plot(XITS$alpha_otu97.5$dominance[,i] ~ log10(META[rownames(XITS$alpha_otu97.5$dominance),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$dominance),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$dominance)[i]),ylab=colnames(XITS$alpha_otu97.5$dominance)[i],xlab="log10 137 Cs")
}

# Same for bacteria
par(mfrow=c(2,3))
for(i in 1:6)
{
plot(X16S$alpha_otu97.5$dominance[,i] ~ log10(META[rownames(X16S$alpha_otu97.5$dominance),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$dominance),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$dominance)[i]),ylab=colnames(X16S$alpha_otu97.5$dominance)[i],xlab="log10 137 Cs")
}

pairs(dom_all[,-1])

#cor.test(dom_all[,2],dom_all[,8])
#cor.test(dom_all[,3],dom_all[,9])
#cor.test(dom_all[,4],dom_all[,10])
#cor.test(dom_all[,5],dom_all[,11])
#cor.test(dom_all[,6],dom_all[,12])
#cor.test(dom_all[,7],dom_all[,13])

rm(dom_all)

0.10.4 Rarity

And finally rarity. No interesting trends or correlations here either.

# start with classical rarity estimates:
XITS$alpha_otu97.5$rarity  <- microbiome::rarity(t(XITS$x_otu97.5))
X16S$alpha_otu97.5$rarity  <- microbiome::rarity(t(X16S$x_otu97.5))
rar_all <- merge(XITS$alpha_otu97.5$rarity,X16S$alpha_otu97.5$rarity,by="row.names",suffixes=c("ITS","16S"))

par(mfrow=c(2,3))
for(i in 1:3)
{
plot(XITS$alpha_otu97.5$rarity[,i] ~ log10(META[rownames(XITS$alpha_otu97.5$rarity),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(XITS$alpha_otu97.5$rarity),"site"]))],pch=16,cex=2,main=paste("ITS", colnames(XITS$alpha_otu97.5$rarity)[i]),ylab=colnames(XITS$alpha_otu97.5$rarity)[i],xlab="log10 137 Cs")
}

# Same for bacteria
for(i in 1:3)
{
plot(X16S$alpha_otu97.5$rarity[,i] ~ log10(META[rownames(X16S$alpha_otu97.5$rarity),"Cs_137"]),col=sitecolors[as.numeric(as.factor(META[rownames(X16S$alpha_otu97.5$rarity),"site"]))],pch=16,cex=2,main=paste("16S", colnames(X16S$alpha_otu97.5$rarity)[i]),ylab=colnames(X16S$alpha_otu97.5$rarity)[i],xlab="log10 137 Cs")
}

pairs(rar_all[,-1])

#cor.test(rar_all[,2],rar_all[,5])
#cor.test(rar_all[,3],rar_all[,6])
#cor.test(rar_all[,4],rar_all[,7])

rm(rar_all)

0.11 Step 11 : beta diversity

We will look into some classical beta diversity measures based on OTU’s, and next we will move on to unifrac-based distance methods, working with the ASV’s.

Visualisation of beta diversity will be with NMDS. Nasrin: first I visualized using MDS (PCoA), but going through literature I find more examples where they use nonmetric. Can you please double check which one is preferred and why?

0.11.1 Bray Curtis based on OTUS

First a Bray Curtis distance based on the OTUS. Notice that the plots are NMDS. They separate between sites extremely well for ITS. This needs more reflection because it’s super simple and by far the nicest beta diversity plot. We could also consider doing the same with a 95% identity cutoff for OTU’s, which should be entirely feasible with the functions defined above.

XITS$beta_otu97.5$BCdist <- vegan::vegdist(XITS$x_otu97.5, method = "bray")
FUNS$betaplot(XITS$beta_otu97.5$BCdist,2,"ITS: NMDS based on Bray Curtis distance")
## Run 0 stress 0.2195616 
## Run 1 stress 0.21654 
## ... New best solution
## ... Procrustes: rmse 0.05236158  max resid 0.1985932 
## Run 2 stress 0.2170456 
## Run 3 stress 0.2228401 
## Run 4 stress 0.2216314 
## Run 5 stress 0.2195616 
## Run 6 stress 0.2161677 
## ... New best solution
## ... Procrustes: rmse 0.01382243  max resid 0.05653742 
## Run 7 stress 0.2226497 
## Run 8 stress 0.2217721 
## Run 9 stress 0.2194505 
## Run 10 stress 0.2170722 
## Run 11 stress 0.2168354 
## Run 12 stress 0.2173403 
## Run 13 stress 0.2166804 
## Run 14 stress 0.2165399 
## ... Procrustes: rmse 0.01378591  max resid 0.05662984 
## Run 15 stress 0.2161854 
## ... Procrustes: rmse 0.005813229  max resid 0.02668189 
## Run 16 stress 0.2392967 
## Run 17 stress 0.2457943 
## Run 18 stress 0.2219012 
## Run 19 stress 0.2234187 
## Run 20 stress 0.2170724 
## Run 21 stress 0.2173403 
## Run 22 stress 0.2168353 
## Run 23 stress 0.222221 
## Run 24 stress 0.2214787 
## Run 25 stress 0.2229275 
## Run 26 stress 0.2168354 
## Run 27 stress 0.2438068 
## Run 28 stress 0.2173404 
## Run 29 stress 0.2215053 
## Run 30 stress 0.2166805 
## Run 31 stress 0.2537598 
## Run 32 stress 0.2217149 
## Run 33 stress 0.2168354 
## Run 34 stress 0.2167716 
## Run 35 stress 0.2168356 
## Run 36 stress 0.2165164 
## ... Procrustes: rmse 0.01289329  max resid 0.05773489 
## Run 37 stress 0.2165399 
## ... Procrustes: rmse 0.01377609  max resid 0.0566067 
## Run 38 stress 0.2225113 
## Run 39 stress 0.2218663 
## Run 40 stress 0.2213901 
## Run 41 stress 0.245587 
## Run 42 stress 0.2170726 
## Run 43 stress 0.2476566 
## Run 44 stress 0.221913 
## Run 45 stress 0.2218941 
## Run 46 stress 0.2216327 
## Run 47 stress 0.2168354 
## Run 48 stress 0.2165164 
## ... Procrustes: rmse 0.01290842  max resid 0.05778355 
## Run 49 stress 0.2216452 
## Run 50 stress 0.2168354 
## Run 51 stress 0.2218901 
## Run 52 stress 0.2168358 
## Run 53 stress 0.2165399 
## ... Procrustes: rmse 0.01378736  max resid 0.05664269 
## Run 54 stress 0.2170733 
## Run 55 stress 0.2170452 
## Run 56 stress 0.2217721 
## Run 57 stress 0.2195844 
## Run 58 stress 0.2165399 
## ... Procrustes: rmse 0.01378502  max resid 0.05664394 
## Run 59 stress 0.2168357 
## Run 60 stress 0.221922 
## Run 61 stress 0.2170721 
## Run 62 stress 0.2168354 
## Run 63 stress 0.2161851 
## ... Procrustes: rmse 0.005744833  max resid 0.02655988 
## Run 64 stress 0.2451963 
## Run 65 stress 0.2168354 
## Run 66 stress 0.2214647 
## Run 67 stress 0.2168356 
## Run 68 stress 0.2229272 
## Run 69 stress 0.2482602 
## Run 70 stress 0.2168354 
## Run 71 stress 0.2218902 
## Run 72 stress 0.2214787 
## Run 73 stress 0.2168355 
## Run 74 stress 0.2173403 
## Run 75 stress 0.2228191 
## Run 76 stress 0.2161851 
## ... Procrustes: rmse 0.005700706  max resid 0.02646405 
## Run 77 stress 0.2215941 
## Run 78 stress 0.2274349 
## Run 79 stress 0.2214786 
## Run 80 stress 0.2170732 
## Run 81 stress 0.2168487 
## Run 82 stress 0.2228843 
## Run 83 stress 0.219617 
## Run 84 stress 0.2187805 
## Run 85 stress 0.2229272 
## Run 86 stress 0.2218607 
## Run 87 stress 0.3956451 
## Run 88 stress 0.2229273 
## Run 89 stress 0.2173403 
## Run 90 stress 0.2174479 
## Run 91 stress 0.2161676 
## ... New best solution
## ... Procrustes: rmse 0.0001113134  max resid 0.0005116164 
## ... Similar to previous best
## *** Best solution repeated 1 times

For 16S, it looks less nice. I’m sure the culprit is the outliers A5T and T3T (see further). Try running it again without them.

X16S$beta_otu97.5$BCdist <- vegan::vegdist(X16S$x_otu97.5, method = "bray")
FUNS$betaplot(X16S$beta_otu97.5$BCdist,2,"16S: NMDS based on Bray Curtis distance")
## Run 0 stress 0.1511107 
## Run 1 stress 0.1925715 
## Run 2 stress 0.1511186 
## ... Procrustes: rmse 0.005389597  max resid 0.02129102 
## Run 3 stress 0.1511186 
## ... Procrustes: rmse 0.005390565  max resid 0.02127708 
## Run 4 stress 0.1951323 
## Run 5 stress 0.15109 
## ... New best solution
## ... Procrustes: rmse 0.002773529  max resid 0.01376874 
## Run 6 stress 0.15109 
## ... New best solution
## ... Procrustes: rmse 5.34065e-06  max resid 1.417754e-05 
## ... Similar to previous best
## Run 7 stress 0.15109 
## ... Procrustes: rmse 2.837567e-06  max resid 8.868053e-06 
## ... Similar to previous best
## Run 8 stress 0.1511186 
## ... Procrustes: rmse 0.004501855  max resid 0.02082956 
## Run 9 stress 0.194051 
## Run 10 stress 0.1809248 
## Run 11 stress 0.2310836 
## Run 12 stress 0.2010865 
## Run 13 stress 0.1874341 
## Run 14 stress 0.2097055 
## Run 15 stress 0.2010865 
## Run 16 stress 0.1511107 
## ... Procrustes: rmse 0.002775317  max resid 0.01378201 
## Run 17 stress 0.1511495 
## ... Procrustes: rmse 0.005117323  max resid 0.02039434 
## Run 18 stress 0.15109 
## ... Procrustes: rmse 3.22082e-06  max resid 1.606723e-05 
## ... Similar to previous best
## Run 19 stress 0.1809248 
## Run 20 stress 0.2010865 
## *** Best solution repeated 3 times

0.11.2 UniFrac based on ASV’s

Next, we generate distance matrices and neighbour joining trees for UniFrac. I have put the entire thing in one function, with two inputs: the dataset (XITS or X16S), and the k value. It calculates 3 kinds of distance matrix based on k-mer distance (euclidean, cosine distance and edgar). Next, it generates neighbour joining trees. With these trees and the untransformed count matrix, it calculates weighted and unweighted unifrac.

Caution: this function was still written for an object that would contain a single count table called “X”. It performs OK still, as long as the order of features (columns) is unchaged between “X” (the original count matrix) and “x” (the count matrix with merged samples and outliers removed)

XITS <- FUNS$generate_beta_asv(XITS, 5)
XITS <- FUNS$generate_beta_asv(XITS, 7)

X16S <- FUNS$generate_beta_asv(X16S, 5)
X16S <- FUNS$generate_beta_asv(X16S, 7)

0.11.3 Unifrac plots ITS

FUNS$betaplot(XITS$beta_asv$k5$cos$WUF,2,"ITS: NMDS based on weighted unifrac")
## Run 0 stress 0.1717321 
## Run 1 stress 0.1717321 
## ... New best solution
## ... Procrustes: rmse 0.001041446  max resid 0.005446282 
## ... Similar to previous best
## Run 2 stress 0.1752452 
## Run 3 stress 0.2441242 
## Run 4 stress 0.2342796 
## Run 5 stress 0.1717316 
## ... New best solution
## ... Procrustes: rmse 0.0006298701  max resid 0.003295059 
## ... Similar to previous best
## Run 6 stress 0.1717248 
## ... New best solution
## ... Procrustes: rmse 0.01084488  max resid 0.05236991 
## Run 7 stress 0.1717322 
## ... Procrustes: rmse 0.01087074  max resid 0.05213562 
## Run 8 stress 0.1717321 
## ... Procrustes: rmse 0.01083825  max resid 0.05261946 
## Run 9 stress 0.1717321 
## ... Procrustes: rmse 0.01086832  max resid 0.05215046 
## Run 10 stress 0.238092 
## Run 11 stress 0.1717244 
## ... New best solution
## ... Procrustes: rmse 0.001119744  max resid 0.005211145 
## ... Similar to previous best
## Run 12 stress 0.2542908 
## Run 13 stress 0.1717242 
## ... New best solution
## ... Procrustes: rmse 0.0003050361  max resid 0.001292722 
## ... Similar to previous best
## Run 14 stress 0.1991133 
## Run 15 stress 0.171724 
## ... New best solution
## ... Procrustes: rmse 0.0003126149  max resid 0.001581896 
## ... Similar to previous best
## Run 16 stress 0.1717244 
## ... Procrustes: rmse 0.000437372  max resid 0.0019257 
## ... Similar to previous best
## Run 17 stress 0.1717317 
## ... Procrustes: rmse 0.01062679  max resid 0.05050413 
## Run 18 stress 0.1717243 
## ... Procrustes: rmse 0.0004433173  max resid 0.002265181 
## ... Similar to previous best
## Run 19 stress 0.2066635 
## Run 20 stress 0.1717319 
## ... Procrustes: rmse 0.010646  max resid 0.05041645 
## *** Best solution repeated 3 times

FUNS$betaplot(XITS$beta_asv$k5$cos$UUF,2,"ITS: NMDS based on unweighted unifrac")
## Run 0 stress 0.2073058 
## Run 1 stress 0.2061372 
## ... New best solution
## ... Procrustes: rmse 0.03470784  max resid 0.1348326 
## Run 2 stress 0.21436 
## Run 3 stress 0.2061366 
## ... New best solution
## ... Procrustes: rmse 0.0003729544  max resid 0.001863726 
## ... Similar to previous best
## Run 4 stress 0.2065002 
## ... Procrustes: rmse 0.01405123  max resid 0.05717469 
## Run 5 stress 0.206137 
## ... Procrustes: rmse 0.0007016308  max resid 0.00353124 
## ... Similar to previous best
## Run 6 stress 0.2070166 
## Run 7 stress 0.2338645 
## Run 8 stress 0.2061371 
## ... Procrustes: rmse 0.0003158124  max resid 0.001569558 
## ... Similar to previous best
## Run 9 stress 0.2338459 
## Run 10 stress 0.2581449 
## Run 11 stress 0.2338351 
## Run 12 stress 0.2061377 
## ... Procrustes: rmse 0.001014854  max resid 0.005110311 
## ... Similar to previous best
## Run 13 stress 0.2388458 
## Run 14 stress 0.2065005 
## ... Procrustes: rmse 0.0141096  max resid 0.0572858 
## Run 15 stress 0.2061375 
## ... Procrustes: rmse 0.0005101487  max resid 0.002560743 
## ... Similar to previous best
## Run 16 stress 0.2061369 
## ... Procrustes: rmse 0.0002474042  max resid 0.001245933 
## ... Similar to previous best
## Run 17 stress 0.2399048 
## Run 18 stress 0.2064996 
## ... Procrustes: rmse 0.01366996  max resid 0.05644463 
## Run 19 stress 0.2338631 
## Run 20 stress 0.2063334 
## ... Procrustes: rmse 0.00496518  max resid 0.02146803 
## *** Best solution repeated 6 times

0.11.4 Unifrac plots 16S

FUNS$betaplot(X16S$beta_asv$k5$cos$WUF,2,"16S: NMDS based on weighted unifrac")
## Run 0 stress 0.1411075 
## Run 1 stress 0.1411075 
## ... Procrustes: rmse 3.01245e-05  max resid 0.000101905 
## ... Similar to previous best
## Run 2 stress 0.2030164 
## Run 3 stress 0.1411075 
## ... New best solution
## ... Procrustes: rmse 3.635883e-06  max resid 1.300527e-05 
## ... Similar to previous best
## Run 4 stress 0.1448017 
## Run 5 stress 0.1411075 
## ... Procrustes: rmse 8.207412e-06  max resid 4.288476e-05 
## ... Similar to previous best
## Run 6 stress 0.2030164 
## Run 7 stress 0.2034605 
## Run 8 stress 0.1411075 
## ... Procrustes: rmse 1.373261e-05  max resid 4.121247e-05 
## ... Similar to previous best
## Run 9 stress 0.191438 
## Run 10 stress 0.1454371 
## Run 11 stress 0.2172065 
## Run 12 stress 0.238678 
## Run 13 stress 0.1447925 
## Run 14 stress 0.1411075 
## ... Procrustes: rmse 2.009719e-05  max resid 5.942629e-05 
## ... Similar to previous best
## Run 15 stress 0.1411075 
## ... Procrustes: rmse 4.34741e-06  max resid 1.274688e-05 
## ... Similar to previous best
## Run 16 stress 0.2186722 
## Run 17 stress 0.1454371 
## Run 18 stress 0.2186705 
## Run 19 stress 0.1411075 
## ... Procrustes: rmse 6.099202e-06  max resid 1.808999e-05 
## ... Similar to previous best
## Run 20 stress 0.1454371 
## *** Best solution repeated 6 times

FUNS$betaplot(X16S$beta_asv$k5$cos$UUF,2,"16S: NMDS based on unweighted unifrac")
## Run 0 stress 0.1384554 
## Run 1 stress 0.1581625 
## Run 2 stress 0.1709945 
## Run 3 stress 0.1581617 
## Run 4 stress 0.1403704 
## Run 5 stress 0.1581617 
## Run 6 stress 0.1581617 
## Run 7 stress 0.2166407 
## Run 8 stress 0.1928733 
## Run 9 stress 0.1637117 
## Run 10 stress 0.1384987 
## ... Procrustes: rmse 0.01945306  max resid 0.0572303 
## Run 11 stress 0.1404465 
## Run 12 stress 0.1637118 
## Run 13 stress 0.1926978 
## Run 14 stress 0.1637118 
## Run 15 stress 0.1922051 
## Run 16 stress 0.1403703 
## Run 17 stress 0.1385792 
## ... Procrustes: rmse 0.01974085  max resid 0.05701214 
## Run 18 stress 0.2156991 
## Run 19 stress 0.2248407 
## Run 20 stress 0.1581617 
## Run 21 stress 0.1709944 
## Run 22 stress 0.1384987 
## ... Procrustes: rmse 0.01949274  max resid 0.05729367 
## Run 23 stress 0.1637118 
## Run 24 stress 0.1637118 
## Run 25 stress 0.1384554 
## ... New best solution
## ... Procrustes: rmse 0.0002098508  max resid 0.001129049 
## ... Similar to previous best
## *** Best solution repeated 1 times

It becomes immediately clear that the 16S unifrac looks very messy. This is all due to the outliers (see earlier) and I’m pretty sure we should repeat the calculations without these datapoints. I was first thinking just removing the two datapoints from the unifrac outcomes, but NJ trees are used to infer evolutionary relationships based on distance data. Outliers can disproportionately influence the tree structure by creating long branches that can attract or repel other branches (long branch attraction). So the outliers would not only affect their own placement in the tree but also alter the topology of the tree regarding other samples. I guess this could distort the inferred relationships among the more typical samples.

I will recalculate the whole thing ditching the two outliers.

Note that these plots are just a selection for illustration for now. You can generate many more plots and they should be explored: * NMDS with more than 2 dimensions (but keep on checking the stress) * You can consider going back to PCoA (see earlier remarks) * Unifrac is based on trees, which are constructed in different manners (different k values and different distance metrics). You can go through them.

0.12 Step 12: Taxonomy

And finally the taxonomy part. Nothing much to say here for now, because it still needs closer inspection, but everything we need should be there. First, we assign taxonomy both for ITS and 16S:

### Assign taxonomy: 

XITS$taxa <- assignTaxonomy(colnames(XITS$x), "~/db/sh_general_release_dynamic_s_all_04.04.2024.fasta",verbose=T,multithread=20)
X16S$taxa <- assignTaxonomy(colnames(X16S$x), "~/db/silva_nr99_v138.1_wSpecies_train_set.fa.gz",verbose=T,multithread=20)

And some barplots on the class level as an illustration:

FUNS$barplot_function(XITS$x,XITS$taxa,"Class",samples=c(1:35),other=12)

FUNS$barplot_function(X16S$x,X16S$taxa,"Class",samples=c(1:39),other=12)

knitr::knit_exit()